Welcome to “Learn with Eskwelabs!” This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.
Hi, everyone! My name is Basty, and I’m a data scientist and educator from Metro Manila. If there’s something I value so much, that is education. Education has allowed me to not only dive into the amazing world of code and data, but also to encourage and inspire others to do the same. Read more about me here.
Outside of work and school, I love playing video games like Valorant and League of Legends. I also love listening to Broadway musicals (HAMILTON, DEH, TICK TICK BOOM ALL THE WAY!). Lastly, I LOVE watching Friends, New Girl, HIMYM, and The Big Bang Theory.
Now, let’s take a look at my notebook!
So you’ve decided to learn another tree-based model, huh? Don’t worry because Forrest Gump and I will help you learn Random Forest, no pun intended. Welcome back again to the series “Learn with Eskwelabs,” where I try to help you become a better data scientist! In this article, we’ll be learning about one of the most popular and commonly used algorithms across real-life projects: Random Forest! Let’s gooo!
Before we dive into the roots (again, no pun intended haha) of Random Forest, it’s important for us to know what ensemble techniques are because Random Forest is an ensemble machine learning algorithm itself.
Let’s say you want to buy a house for you and your family, will you just willingly buy the very first house you see? That’s very unlikely to happen. What’s more likely to happen is that you’d look at a few websites, check locations, number of bedrooms, facilities, price, etc. You might even go asking some friends of yours that might know something. In other words, you won’t directly make a conclusion, but you will make a decision considering other information out there.
Ensemble techniques work the same way! It is basically the process of combining multiple models. In a more technical explanation, ensemble techniques is a collection of models used to make predictions instead of having only one model do the job, and this will return an increase in the overall performance. We won’t be tackling ensemble techniques fully, instead here’s an additional resource that you can read about it.
We now know that Random Forest is a model that uses ensemble technique, but what actually is it as an algorithm? Technically, it’s an algorithm that combines many weak classifiers (recall my article on logistic regression) to provide answers to complex problems. And just like Logistic Regression, Random Forest is also a supervised machine learning algorithm that is widely used for regression and classification problems.
“Cool! But why is it called Random Forest?” That’s because it consists of many decision trees.
Don’t worry if you haven’t gone through decision trees yet, because I’ll try to break this down as simple as I can for you. Think of a forest, and rather than depending on one tree to make predictions, Random Forest takes the prediction from different trees and predicts the final output based on majority votes of prediction.
Here’s another real-life example of Random Forest so we can understand it better. Let’s say you are a college student entering your senior year, and you still haven’t decided what class to take for your free elective. So you go to your best friend and ask them what they suggest, and your friend says take the Game Development class because it’s fun. At the same time, your other friends also gave you suggestions on other classes you can take. Finally, after talking to different people about what class to take, you decided to take the class that was suggested the most by your friends.
In Random Forest, we train a number of decision trees (these are essentially the questions you ask), and the answer that gets the maximum votes gets to be the final result if it’s a classification problem and average if it’s a regression problem. Hopefully that example gave you a clear understanding of how this algorithm works.
So I keep mentioning decision trees since the start of the article, but what exactly are they and why do you need to know them? To know how a Random Forest algorithm works, we need to know decision trees, which is also a supervised machine learning algorithm used for classification and regression problems.
Decision trees use a flowchart like tree structure that shows the predictions. It starts with a root node and ends with a decision made by leaves. It’s basically a bunch of if/else statements.
“Okay, but how do you determine which feature should be our root node?” Excellent question! To answer your question, we need to know something called the Gini index.
To fully understand how this concept works, you will have to wear your mathematical goggles, but don’t worry because I won’t be introducing any math concepts here as I try to explain this.
We basically need to know the impurity of our dataset, and we’ll take the feature as the root node which gives us the lowest impurity, or the lowest Gini index.
Here’s an example: When we take feature 1 as our root node, we get a pure split—a pure split means you either get yes or no. For feature 2, the split is not pure. How do we determine how impure these nodes are? That’s where the Gini Index comes in!
The algorithm will try to find the Gini index of all possible splits and will choose the feature with the lowest Gini index as the feature. The lowest Gini index means low impurity.
We won’t be going deeper into decision trees since we are focusing on Random Forest, but here is an additional resource you can look into!
In this illustration we can see the dataset is divided into subsets. Each subset will then have its own decision tree and prediction. After getting the prediction of all decision trees, we then take the majority vote and have that as our final prediction.
Now at this point, you might already be wondering what is the difference between a decision tree and a random forest algorithm.
Decision trees
Random Forest
Just by looking at the difference between the two, we can already say that; generally, Random Forest is a better option to take than decision trees.
Where do we even get to use Random Forest? There’s a lot of domains where we can apply this algorithm, and here are some of the major applications of it in different sectors:
What better way to finish our learning after some theoretical concepts with some hands-on coding! For this implementation, we’ll be using scikit-learn to make a random forest model, and for our dataset we will be using the famous Iris dataset!
Wow, we just went by Random Forest like a breeze!
Here’s what we accomplished in this blog post:
There are still a lot of concepts and other algorithms you’ll encounter, and in this article we just looked at a very powerful machine learning algorithm, so give yourself a pat on the back.
If you’re interested in learning even more about Random Forest and other algorithms, I highly suggest joining the 12-week Data Science Fellowship, wherein you’ll be applying it to multiple projects across different industries!
Maybe you’re in the healthcare industry and you’re interested in predicting whether a cancer is malignant or benign? Or maybe you’re interested in music and making your own music recommendations? Wherever your interest lies, there is an application where you can leverage the power of Random Forest in the bootcamp!
Hopefully, this article was able to help you gain an understanding of what the Random Forest algorithm is about, and made machine learning a little less intimidating.
From the notebook of Basty Vergara | Connect with Basty via LinkedIn and Notion
Updated for Data Science Fellowship Cohort 10 | Classes for Cohort 10 start on September 12, 2022.
If you’re ready to dive in
If you want to know more
Bootcamp preparation
Bootcamp payment options
Other Bootcamp features