Welcome to “Learn with Eskwelabs!” This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.
Hi, everyone! My name is Basty, and I’m a data scientist and educator from Metro Manila. If there’s something I value so much, that is education. Education has allowed me to not only dive into the amazing world of code and data, but also to encourage and inspire others to do the same. Read more about me here.
Outside of work and school, I love playing video games like Valorant and League of Legends. I also love listening to Broadway musicals (HAMILTON, DEH, TICK TICK BOOM ALL THE WAY!). Lastly, I LOVE watching Friends, New Girl, HIMYM, and The Big Bang Theory.
Now, let’s take a look at my notebook!
So you’ve decided to study Linear Regression? The only thing I can say about that is…LET’S GOOOO! Welcome back again to the series “Learn with Eskwelabs,” where I try to help you become a better data scientist! In this article, we’ll be learning about the famous—and perhaps the simplest algorithm out there (according to many)—Linear Regression!
Before we even go about learning what regression really is, it’s important that I give you an idea first of what predictions are in the world of machine learning. In machine learning, we teach our model a set of data, and after teaching the model, we test its knowledge on a new set of data that is similar to the original one. The results that our model gives us are what we call predictions.
To better understand, try to imagine that you are teaching a child the multiplication table. After teaching the multiplication table, you gave the child a set of multiplication problems. The answers that the child gave you are the predictions because the child’s answers can either be right or wrong since he/she is still learning. Hopefully, that gave you a clearer understanding of what predictions are.
And so, what is regression? Regression is a form of predictive modeling technique which investigates the relationship between the dependent and independent variable. If you look back to what I said about predictions, the same concept applies to regression, and that is why it’s a form of prediction.
- The dependent variable is the main thing that we are trying to predict
- The independent variable is what we used to predict the dependent variable.
So in other words, regression is like saying, “I wonder how long will it take me to finish this apple if I slice it into pieces?” or “How much sales will I make if I make my ice cream colder?”
“But Basty, is regression and Linear Regression different?” Well I’m glad you asked because along the way as you continue on learning, you’ll encounter different types of regression algorithms. So to answer your question, no, they are not different. Linear Regression is a type of regression.
Linear Regression is a regression algorithm used to find the linear relationship between a dependent variable, and one or more independent variables. If you’re still a bit confused about what Linear Regression is, just think about what I said about the apple and the slicing of it into pieces. You’re basically trying to find out if slicing the apple into pieces has any kind of relationship, may it be positive or negative, with the amount of time it takes to finish an apple.
Types of regression
There are many types of regression analysis that you’ll be encountering in your journey as a data scientist, but in this article, I’ll only be talking about two of them.
- Simple Linear Regression
- Multiple Linear Regression
Simple Linear RegressionThe same definition of Linear Regression still applies to this type of regression. What you only need to remember is that in simple linear regression, we only use a single independent and dependent variable. In other words, we have something that we want to predict, and at the same time, we have something to predict it with. It’s that simple, no pun intended haha.
The core idea in Linear Regression is to obtain a line that best fits the data. The best fit line is the one for which the total prediction error (all data points) are as small as possible. Error refers to the distance between the point to the regression line. Now that might sound a bit technical so let’s visualize it!
- Red line - This is the best fit line, that represents if there is a relationship between the two variables, and this is also where our predicted values lie.
- Black lines - The black lines are what we call the residuals or errors, and it is basically the distance of the actual point to the predicted point. The shorter the lines are, the better the predictions.
- Black dots - These are the data points.
Now there are a lot of different equations to represent the line that best fits out there, so for simplicity let’s just stick with the most common and I think the simplest one.
- X and Y are the independent and dependent variables respectively
- b1 and b0 are the slope and y-intercept respectively
b0 and b1 are what we call model coefficients. It’s important to get these values to get the best line. “But Basty, you keep on mentioning this line of best fit. How do we actually get the best line?” Excellent question!
How will the model obtain the best line?
As promised, we won’t be doing any computations, so let’s try to understand how this step works. First, our model will try a bunch of different straight lines based on our coefficients, and from those different lines our model will select the best one that predicts our data points well.
Wow, pretty! Yes, it is pretty, but can you guess which one is the best line among those 4 lines? To identify which is the best one, we commonly use MSE (Mean Squared Error) or Least Square Method.
- y is the actual value
- y^i is the predicted value
We already know how the model selects the best fit line, but as I have highlighted in my previous article, we need to know how well the model performed. You need to evaluate whether the accuracy of your model is good or not.
The most common metric for evaluating the overall fit of a linear regression model is R-squared or the coefficient of determination. It basically explains how much of the dependent variable is being explained or shown in the model.
- R-squared is between 0-1, and the higher the better because it means that more variance is explained by the model.
- If its 1, then it's a perfect fit.
- If it's low, such as 0.2, then that would mean the actual values are far away from the regression line, or there are too many outliers.
“But Basty, is there no optimal way of finding the best line?” You are on a roll! Optimization algorithms to the rescue! Optimization algorithms are used in machine learning to make computations faster. Kachow! In Linear Regression, we use Gradient Descent to find the most optimal value for our model coefficients. We won’t go in-depth with this, but here’s a great resource to understand Gradient Descent well.
Multiple Linear RegressionIn simple linear regression, we only use one independent variable to predict the dependent variable. However, in multiple linear regression we use more than one independent variable to predict the dependent variable.
The equation for this kind of regression is a bit different from what I have already shown you since we are dealing with more than one independent variable.
- b0 is still the y-intercept
- b1, b2, b3,...bn are the slopes of the independent variables
- y is of course the dependent variable
Now you might be wondering why we even need to use multiple linear regression when we’re dealing with multiple variables. Why not just use simple linear regression on each independent variable separately? That would’ve been easier than knowing about another type of linear regression right? Well, not really.
Running separate simple linear regression models will lead to different outcomes if we use different variables at a time. Imagine that you are trying to see how TV, newspaper, and radio ads affect a company’s sales. We can definitely see how each of these mediums independently affect the sales, but we might miss the bigger picture that these three are actually working together to impact the company’s sales. Multiple linear regression solves this problem by taking all the variables in a single expression into account.
I won’t be going in-depth already on multiple linear regression because it also more or less functions the same as simple linear regression, and the model will also do the same steps. If you want to understand multiple linear regression more, here is an article that can help you understand it better.
At this point, you might already be wondering, “Where is Linear Regression actually applied in the real-world? Aside from the ice cream and apple example.” My friend, Linear Regression is used in a wide variety of real-life situations across different industries.
- Show me the money — In different companies, they often use Linear Regression to understand the relationship between ad spendings and company revenue to perhaps increase their sales.
- Under pressure — In medical research, they also use it to understand the relationship between drug dosage and blood pressure of patients.
- Grow, glow, and go — As well as in the agriculture industry, they use the model to measure the effect of fertilizers and water on crop yields.
- Go the distance — And if you’re an avid sports fan, data scientists use regression to measure the effect of different training styles on the players’ performance.
In a nutshell, I’ve already explained what Linear Regression really is. But as for every learning algorithm you might have already encountered, there are of course some assumptions that you need to know.
LinearityThe first assumption of linear regression is that there must be a linear relationship between the dependent and independent variables, because without a linear relationship, the predictions won’t be accurate.
One classic example of a linear relationship is when we’re trying to predict the test scores of students based on the number of hours they studied. Of course, studying a lot of hours doesn’t necessarily guarantee high scores, but the relationship is still a linear one.
NormalityNormality means that our residuals or errors should be normally distributed. Understanding what normality is can be a bit difficult so let’s try to break it down.
Every sample or data point in our dataset has its own residual or error value that represents the distance between the actual and predicted values. And basically, normality tells us that all the residual values should be normally distributed. Now what does normally distributed mean? It means that the residual values are similar, and that there are few outliers.
Applying that concept to the same example, the students reported their activities like studying, sleeping, and engaging in social media. Now, all these activities have a relationship with each other. If you study for a more extended period, you sleep for less time. Similarly, extended hours of studying affects the time you engage in social media.
MulticollinearityAnother critical assumption of linear regression is that the independent variables should not be correlated with each other. This assumption mainly applies to multiple linear regression. Let’s use the same example to understand this assumption better, but instead of just using the number of hours they studied, let’s also include the amount of time they spent on social media, and the number of hours they slept. You might say that there could be a relationship among the variables, but there isn’t that much correlation.
Some students got a high mark, despite spending too much time on social media, while some students got a lower mark, despite spending more hours studying. If you think there is multicollinearity among your variables, the best solution is to remove some of them because we don’t want variables doing the same thing.
HomoscedasticityFinally, the last assumption of a linear regression model is that there should be homoscedasticity among the data. Similar to what normality is, homoscedasticity also uses residual values. If all your residual or error values are equal or constant across all your data points then the data is called homoscedastic, but if the residual values are not equal then it is called heteroscedastic.
We will be using Scikit-Learn to implement a simple linear regression model. Scikit-learn is a free Python library that contains tools for machine learning projects.
import pandas as pd import numpy as np import matplotlib.pyplot as plt # creating a dummy dataset np.random.seed(10) X = np.random.rand(50,1) y = 3 + 3 * X + np.random.rand(50,1) # visualize linearity of data plt.scatter(X,y, s=10) plt.show() # modelling from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X,y) pred = regressor.predict(X) # model evaluation from sklearn.metrics import r2_score, mean_squared_error mse = mean_squared_error(y, pred) r2 = r2_score(y, pred) # best fit line plt.scatter(X,y) plt.plot(X, pred, color='Black', marker='o') # results print("Mean Squared Error: ", mse) print("R-Squared: ", r2) print("Y-intercept: ", regressor.intercept_) print("Slope: ", regressor.coef_)
Wow, that was a long journey wasn’t it? We first learned about what regression is and its different types. Then we also talked about simple linear regression and multiple linear regression, where we introduced the linear regression equation and metrics.
Afterwards, we talked about the applications and assumptions of linear regression. By then, we were done with the theory and got our hands dirty by implementing the algorithm using Python!
To summarize: Linear Regression is a powerful predictive tool used to estimate the relationship between an independent variable and a dependent variable using a straight line.
There are many more skills you need to acquire in order to fully grasp how the linear regression algorithm works. One of the ways you can acquire more skills in understanding the algorithm is by joining the 12-week Data Science Fellowship, wherein you’ll be doing multiple projects using different technologies across different industries!
Perhaps you’re interested in predicting the number of votes of those who won the previous presidential and senatorial elections? Or maybe you want to predict the number of voters that actually voted using their demographics? Discover all those answers, and leverage the power of regression through the bootcamp.
Hopefully, this article was able to help you gain an understanding of what, why, and how Linear Regression works, and made machine learning a little less intimidating.
Never stop learning!From the notebook of Basty Vergara | Connect with Basty via LinkedIn and Notion
This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.
RECOMMENDED NEXT STEPSUpdated for Data Science Fellowship Cohort 10 | Classes for Cohort 10 start on September 12, 2022.
- If you’re ready to dive in
- Enroll in the Data Science Fellowship via the sign up link here and take the assessment exam.
- Note: The assessment exam is a key part of your application. The deadline for the assessment is on August 21, 2022.
- Enroll in the Data Science Fellowship via the sign up link here and take the assessment exam.
- If you want to know more
YOUR NEXT READ
- Bootcamp preparation
- Bootcamp payment options
- Other Bootcamp features