Welcome to “Learn with Eskwelabs!” This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.
Hi, everyone! My name is Basty, and I’m a data scientist and educator from Metro Manila. If there’s something I value so much, that is education. Education has allowed me to not only dive into the amazing world of code and data, but also to encourage and inspire others to do the same. Read more about me here.
Outside of work and school, I love playing video games like Valorant and League of Legends. I also love listening to Broadway musicals (HAMILTON, DEH, TICK TICK BOOM ALL THE WAY!). Lastly, I LOVE watching Friends, New Girl, HIMYM, and The Big Bang Theory.
Now, let’s take a look at my notebook!
So you’ve decided to learn another tree-based model, huh? Don’t worry because Forrest Gump and I will help you learn Random Forest, no pun intended. Welcome back again to the series “Learn with Eskwelabs,” where I try to help you become a better data scientist! In this article, we’ll be learning about one of the most popular and commonly used algorithms across real-life projects: Random Forest! Let’s gooo!
What are Ensemble Techniques in Machine Learning?
Before we dive into the roots (again, no pun intended haha) of Random Forest, it’s important for us to know what ensemble techniques are because Random Forest is an ensemble machine learning algorithm itself.
Let’s say you want to buy a house for you and your family, will you just willingly buy the very first house you see? That’s very unlikely to happen. What’s more likely to happen is that you’d look at a few websites, check locations, number of bedrooms, facilities, price, etc. You might even go asking some friends of yours that might know something. In other words, you won’t directly make a conclusion, but you will make a decision considering other information out there.
Ensemble techniques work the same way! It is basically the process of combining multiple models. In a more technical explanation, ensemble techniques is a collection of models used to make predictions instead of having only one model do the job, and this will return an increase in the overall performance. We won’t be tackling ensemble techniques fully, instead here’s an additional resource that you can read about it.
What is Random Forest?
We now know that Random Forest is a model that uses ensemble technique, but what actually is it as an algorithm? Technically, it’s an algorithm that combines many weak classifiers (recall my article on logistic regression) to provide answers to complex problems. And just like Logistic Regression, Random Forest is also a supervised machine learning algorithm that is widely used for regression and classification problems.
“Cool! But why is it called Random Forest?” That’s because it consists of many decision trees.
Don’t worry if you haven’t gone through decision trees yet, because I’ll try to break this down as simple as I can for you. Think of a forest, and rather than depending on one tree to make predictions, Random Forest takes the prediction from different trees and predicts the final output based on majority votes of prediction.
Here’s another real-life example of Random Forest so we can understand it better. Let’s say you are a college student entering your senior year, and you still haven’t decided what class to take for your free elective. So you go to your best friend and ask them what they suggest, and your friend says take the Game Development class because it’s fun. At the same time, your other friends also gave you suggestions on other classes you can take. Finally, after talking to different people about what class to take, you decided to take the class that was suggested the most by your friends.
In Random Forest, we train a number of decision trees (these are essentially the questions you ask), and the answer that gets the maximum votes gets to be the final result if it’s a classification problem and average if it’s a regression problem. Hopefully that example gave you a clear understanding of how this algorithm works.
What are decision trees?
So I keep mentioning decision trees since the start of the article, but what exactly are they and why do you need to know them? To know how a Random Forest algorithm works, we need to know decision trees, which is also a supervised machine learning algorithm used for classification and regression problems.
Decision trees use a flowchart like tree structure that shows the predictions. It starts with a root node and ends with a decision made by leaves. It’s basically a bunch of if/else statements.
- Root node - the very top node or where the tree starts dividing
- Decision node - the result after splitting a node
- Terminal or leaf node - nodes that can no longer be split
“Okay, but how do you determine which feature should be our root node?” Excellent question! To answer your question, we need to know something called the Gini index.
To fully understand how this concept works, you will have to wear your mathematical goggles, but don’t worry because I won’t be introducing any math concepts here as I try to explain this.
We basically need to know the impurity of our dataset, and we’ll take the feature as the root node which gives us the lowest impurity, or the lowest Gini index.
Here’s an example: When we take feature 1 as our root node, we get a pure split—a pure split means you either get yes or no. For feature 2, the split is not pure. How do we determine how impure these nodes are? That’s where the Gini Index comes in!
The algorithm will try to find the Gini index of all possible splits and will choose the feature with the lowest Gini index as the feature. The lowest Gini index means low impurity.
We won’t be going deeper into decision trees since we are focusing on Random Forest, but here is an additional resource you can look into!
Steps in Random Forest
- Step 1 - We first make subsets of our original data by taking an n number of random records.
- Step 2 - Individual decision trees are constructed for each subset.
- Step 3 - Each decision tree will give an output
- Step 4 - Final output is considered based on majority voting if it’s a classification problem and average if it’s a regression problem.
In this illustration we can see the dataset is divided into subsets. Each subset will then have its own decision tree and prediction. After getting the prediction of all decision trees, we then take the majority vote and have that as our final prediction.
Difference between Decision Tree and Random Forest
Now at this point, you might already be wondering what is the difference between a decision tree and a random forest algorithm.
|Decision trees normally suffer from the problem of overfitting if it’s allowed to grow until their maximum depth. In terms of computations, a single decision tree is faster.
||It creates a subset of the dataset, and the final output is based on majority ranking, thus the problem of overfitting is solved. Since it uses more than one decision tree, it is comparatively slower.
|In terms of computations, a single decision tree is faster.
||Since it uses more than one decision tree, it is comparatively slower.
|A decision tree will formulate some set of rules to do prediction.
||It doesn’t use any set of formulas since it randomly selects observations, builds decision trees, and takes the average result.
Just by looking at the difference between the two, we can already say that; generally, Random Forest is a better option to take than decision trees.
Where do we even get to use Random Forest? There’s a lot of domains where we can apply this algorithm, and here are some of the major applications of it in different sectors:
- Banking Industry - In the banking industry, Random Forest is mainly used for credit card fraud detection, customer segmentation, and predicting loan defaults.
- Healthcare Industry - In healthcare, we use the algorithm in predicting cardiovascular diseases, diabetes, and breast cancer.
- Stock Market Industry - In this industry, we use it to predict the stock market, conduct sentiment analysis on the market, and detect bitcoin price.
- E-Commerce - In the e-commerce industry, we use it to do product recommendations, price optimizations, and search ranking.
Implementation (Code add-ons won’t work)
What better way to finish our learning after some theoretical concepts with some hands-on coding! For this implementation, we’ll be using scikit-learn to make a random forest model, and for our dataset we will be using the famous Iris dataset!
# Loading the library with the iris dataset from sklearn.datasets import load_iris # Loading random forest classifier library from sklearn.ensemble import RandomForestClassifier # Loading pandas import pandas as pd # Loading numpy import numpy as np # Setting random seed np.random.seed(0) # creating an object called iris with the iris data iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df.head() # Adding a new column for the species name df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names) df.head() # Creating test and train data df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 df.head() # Creating dataframes with test rows and training rows train, test = df[df['is_train']==True], df[df['is_train']==False] # Show the number of observations for the test and training dataframes print('Number of observations in the training data:', len(train)) print('Number of observations in the test data:', len(test)) # Create a list of the feature column's names features = df.columns[:4] features # Converting each species name into digits y = pd.factorize(train['species']) y # Creating a random forest Classifier clf = RandomForestClassifier(n_jobs=2, random_state=0) # Training the classifier clf.fit(train[features], y) # Applying the trained Classifier to the test clf.predict(test[features]) # Viewing the predicted probabilities of the first 10 observations clf.predict_proba(test[features])[0:10] # Mapping names for the plants for each predicted plant class preds = iris.target_names[clf.predict(test[features])] preds[0:5] # Viewing the ACTUAL species for the first five observations test['species'].head() # Creating confusion matrix pd.crosstab(test['species'], preds, rownames=['Actual Species'], colnames=['Predicted Species'])
Wow, we just went by Random Forest like a breeze!
Here’s what we accomplished in this blog post:
- We first learned about what ensemble techniques are in machine learning where we even gave a real-life analogy to understand it.
- We also briefly discussed what decision trees are, and what role they play in the Random Forest algorithm.
- We then went through some real-life applications of Random Forest, where we found out that it’s actually being applied in various industries.
- Don’t forget, we also discussed the steps the algorithm makes, as well as its difference with decision trees.
- And to top it all off, we even implemented Random Forest using Python!
There are still a lot of concepts and other algorithms you’ll encounter, and in this article we just looked at a very powerful machine learning algorithm, so give yourself a pat on the back.
If you’re interested in learning even more about Random Forest and other algorithms, I highly suggest joining the 12-week Data Science Fellowship, wherein you’ll be applying it to multiple projects across different industries!
Maybe you’re in the healthcare industry and you’re interested in predicting whether a cancer is malignant or benign? Or maybe you’re interested in music and making your own music recommendations? Wherever your interest lies, there is an application where you can leverage the power of Random Forest in the bootcamp!
Hopefully, this article was able to help you gain an understanding of what the Random Forest algorithm is about, and made machine learning a little less intimidating.
Never stop learning!From the notebook of Basty Vergara | Connect with Basty via LinkedIn and Notion
This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.
RECOMMENDED NEXT STEPSUpdated for Data Science Fellowship Cohort 10 | Classes for Cohort 10 start on September 12, 2022.
- If you’re ready to dive in
- Enroll in the Data Science Fellowship via the sign up link here and take the assessment exam.
- Note: The assessment exam is a key part of your application. The deadline for the assessment is on August 21, 2022.
- Enroll in the Data Science Fellowship via the sign up link here and take the assessment exam.
- If you want to know more
YOUR NEXT READ
- Bootcamp preparation
- Bootcamp payment options
- Other Bootcamp features