Welcome to “Learn with Eskwelabs!” This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.

Basty’s Notebook

Hi, everyone! My name is Basty, and I’m a data scientist and educator from Metro Manila. If there’s something I value so much, that is education. Education has allowed me to not only dive into the amazing world of code and data, but also to encourage and inspire others to do the same. Read more about me here.

Outside of work and school, I love playing video games like Valorant and League of Legends. I also love listening to Broadway musicals (HAMILTON, DEH, TICK TICK BOOM ALL THE WAY!). Lastly, I LOVE watching Friends, New Girl, HIMYM, and The Big Bang Theory.

Now, let’s take a look at my notebook!

In this article, we’ll be learning about another famous and simple machine learning algorithm: Logistic Regression! There are a lot of articles out there for this algorithm, but most of them are purely mathematical, and this blog post isn’t so. Alright, LET’S GOOO!


In my previous post, we learned about Linear Regression. Now at this point you might already have questions regarding the difference between Linear Regression and Logistic Regression because they both have the word “regression.” Are they the same thing? Are they both used for predicting something? Or perhaps they’re brothers…...I’m sure you have a lot more questions, so let’s get started!

What is Classification?

Before we even go about learning the algorithm itself, let’s first recall what we learned in my two previous articles.
  • In the first article, we tackled K-Means, which is used for clustering problems.
  • In the second article we learned about Linear Regression, which is used for regression problems, or simply for predicting values.
Now even though the algorithm we’ll be learning about has the word “regression” in it, it doesn’t mean it’s also the same thing as Linear Regression. Actually, Logistic Regression is an algorithm that we used for classification problems.

In machine learning, classification is a supervised learning concept which basically categorizes the data into classes.

To understand this better, let’s compare it to clustering. In clustering, our primary goal is to group the data into clusters based on their similarities. But in classification, our primary goal is to assign a label or class to the data, and then predict what class the given data points belong to. For example, you want to classify if an email is spam or not, or maybe you want to classify a certain movie’s genre.

Think of regression, clustering, and classification as Thor, Captain America, and Iron Man of the Avengers. They are the main machine learning tasks you’ll be encountering in your career as a data scientist!

Logistic Regression

Now that we have an understanding of what classification is in machine learning, let’s actually try to understand what our main topic is really about. As I’ve mentioned earlier, Logistic Regression is something we used for classification problems, and so we can still apply the same concept to it.

Logistic regression is a supervised learning algorithm used to predict a dependent categorical target variable. In other words, we use this algorithm to predict what category our dependent variable belongs to.

Although, always keep in mind that the dependent variable we are trying to classify should be categorical, meaning it’s labeled, unlike numerical values which aren’t labeled.

I try to always apply a concept into an example to understand it better, and to actually visualize how it works, and so here’s an example of Logistic Regression. Imagine you were given a dog and an orange and you wanted to find out whether each of these items was an animal or not. The result of this task would be for the dog to end up classified as an animal, and for the orange to be categorized as not an animal. So in this example, we are trying to predict whether the two objects are animals or not.

“Okay, now I know the primary goal of Logistic Regression is to classify whether something belongs to a certain class or not, but how does it actually determine if something belongs to that certain class?” Excellent question! Let’s dive a little deeper into how the algorithm would classify our given example. Just like in Linear Regression, we have independent variables that we choose to predict the dependent variable, and that applies here as well.

The model would have to know some characteristics of the items first before deciding where they belong. Since our example includes a dog and an orange, the characteristics could be the color, size, weight, shape, height, or even number of legs. In this way, knowing that an orange’s shape was a circle may help the algorithm to conclude that the orange was not an animal. Similarly, knowing that the orange had zero limbs would help as well.

Types of Logistic Regression

“Basty, the example that you gave only used two labels: animal or not animal, but how about if I want to classify if it’s a dog, cat, or fish?” Your questions keep getting better and better, my friend!

There are three main types of Logistic Regression that you’ll be encountering: binary, multinomial, and ordinal. They all differ in terms of theory and execution. Let’s go through them one by one!

Binary Logistic Regression

You already got introduced to this type of Logistic Regression earlier with the classifying of a dog and orange as an animal or not. In this type of logistic regression, there are only two possible outcomes. This type is typically represented as a 1 or 0.

Here’s an example of a Binary Logistic Regression:

  • Assessing cancer risk (high or low)
  • A basketball team winning a game (yes or no)

Multinomial Logistic Regression

This type of Logistic Regression requires multiple classes that an item can be classified as. In order to do this, you need to have at least three or more classes to run this algorithm.

Here’s an example of a Multinomial Logistic Regression:

  • Food texture (crunchy, crispy, mushy)
  • Hair color (blonde, brown, red, brunette)

Ordinal Logistic Regression

This type of Logistic Regression is still the same with a Multinomial Logistic Regression, but this requires order with the classes being defined.

Here’s an example of an Ordinal Logistic Regression:

  • Customer rating (extremely dislike, dislike, neutral, like, extremely like)
  • Income level (low income, middle income, high income)

Linear vs Logistic

We now have discussed the different types of logistic regression, and at this point, you might be wondering what the difference is between a Linear Regression model and Logistic Regression model.
  • Difference #1: While they are both types of supervised learning, the first difference is that Logistic Regression handles categorical data, while Linear Regression handles numerical or continuous data. That means Logistic Regression is used for data that can be used for classification, and linear regression is used for predicting a value.
  • Difference #2: Linear Regression outputs a continuous value such as 10.2, 115.6, etc., while Logistic Regression outputs a discrete value such as yes or no in the form of 1’s and 0’s.
  • Difference #3: In terms of metrics, recall that from my linear regression article we discussed MSE or mean squared error for measuring the performance of our linear model. But for Logistic Regression, we use the maximum likelihood estimation, or in simple terms, the probability that a data point can be classified as x or y.
There are still a couple of differences between a Linear Regression model and Logistic Regression model that can be helpful for you to know about, and most of it already tackles the mathematical aspect of it. Here is an additional resource for you to check out!

Applications

We already brought up a lot of examples where Logistic Regression is used, but where is it actually used in the real-world? Let’s go through them, shall we?
  • Banking - Logistic regression may be used when predicting whether bank customers are likely to default on their loans. This is a calculation a bank makes when deciding if it will or will not lend to a customer and assessing the maximum amount the bank will lend to those it has already deemed to be creditworthy
  • Medical Research - In order to calculate risk of cancer, researchers would have to assess a patient’s different factors such as age, race, weight, smoking status, drinking habits, exercise habits, medical history, family history of cancer, and many more.

Implementation

I know that up to this point, we haven’t been technical in terms of discussing Logistic Regression, but now we need to get our hands dirty with some code! Just like in our previous articles, we’d be using scikit-learn! Scikit-learn is a free Python library that contains tools for machine learning projects.

# this is the dataset we will be using
dataset_url = "https://raw.githubusercontent.com/harika-bonthu/02-linear-regression-fish/master/datasets_229906_491820_Fish.csv"

# create a dataframe from the csv file
import pandas as pd
fish = pd.read_csv(dataset_url, error_bad_lines=False)
fish.head()

# defining input and target variables
X = fish.iloc[:, 1:]
y = fish.loc[:, 'Species']

# scaling the features
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

# label encode the target variable
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# splitting the dataset 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# model building and training
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
# training the model
clf.fit(X_train, y_train)

# predicting the result
y_pred = clf.predict(X_test)

# computing the accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

# confusion matrix to check how well we did
from sklearn.metrics import confusion_matrix
cf = confusion_matrix(y_test, y_pred)
plt.figure()
sns.heatmap(cf, annot=True)
plt.xlabel('Prediction')
plt.ylabel('Target')
plt.title('Confusion Matrix')

Learn by Doing

In the article, we tackled Logistic Regression in a non-technical and mathematical way to have an understanding of it. However, having a basic understanding of what it is is not enough. We still need to know how it technically works as an algorithm. Here is an additional resource that I think can help you have a better grasp of the algorithm in terms of its technicalities.

So just to recap what we learned today:

  1. We first learned about what classification is and how it relates to regression and clustering.
  2. We then learned about the main star of this article, Logistic Regression and its different types.
  3. After that, we discussed a bit on the difference between Linear Regression and Logistic Regression.
  4. Lastly, we looked at two applications of the algorithm in the real-word, and finally got our hands dirty with some python code!
To summarize: Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set..

There is still a lot to learn about how Logistic Regression works as an algorithm, and this article was mainly to give you a boost towards it. Aside from the resources that I gave you, one of the many ways you can also learn the how’s of Logistic Regression is through the 12-week Data Science Fellowship, wherein you’ll be doing multiple projects using different technologies across different industries!.

Maybe you’re interested in predicting what specific genres your favorite spotify playlist songs fall under? Or if you’re interested in the banking industry, predicting whether a customer is likely to churn or not is something that gets your interest. Discover all those answers and harness the power of classification through the bootcamp!.

Hopefully this article was able to help you gain a basic understanding to boost your learning of how Logistic Regression works and made machine learning a little less intimidating.

Never stop learning!

From the notebook of Basty Vergara | Connect with Basty via LinkedIn and Notion

This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.

RECOMMENDED NEXT STEPS

Updated for Data Science Fellowship Cohort 10 | Classes for Cohort 10 start on September 12, 2022.

  • If you’re ready to dive in
    • Enroll in the Data Science Fellowship via the sign up link here and take the assessment exam.
      • Note: The assessment exam is a key part of your application. The deadline for the assessment is on August 21, 2022.
  • If you want to know more

YOUR NEXT READ