Welcome to “Learn with Eskwelabs!” This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.
Hi, everyone! My name is Basty, and I’m a data scientist and educator from Metro Manila. If there’s something I value so much, that is education. Education has allowed me to not only dive into the amazing world of code and data, but also to encourage and inspire others to do the same. Read more about me here.
Outside of work and school, I love playing video games like Valorant and League of Legends. I also love listening to Broadway musicals (HAMILTON, DEH, TICK TICK BOOM ALL THE WAY!). Lastly, I LOVE watching Friends, New Girl, HIMYM, and The Big Bang Theory.
Now, let’s take a look at my notebook! IKUZO!! (Insert Naruto’s voice).
K-Means in the Data Science Fellowship
K-Means.....that’s it. As early as now, I want you to know that K does not actually stand for any word that starts with a letter K. In fact, K in the word K-Means actually stands for clusters or even centroids.
If you’re reading this article, it’s probably because you’ve heard of the term called Unsupervised Learning and Clustering, and that you’ve probably already looked at Supervised Learning algorithms on the way here. Well today is a good day to be reading this because in this article, I will share with you some practical knowledge that you’ll need in dealing with Clustering, specifically using the K-Means algorithm. Let’s go!
Before we even go about learning the K-Means algorithm, it’s essential to understand first what Unsupervised Learning is. Fortunately, there’s already this great article on what Machine Learning is, and it includes a simple-to-understand approach to what these different types of learnings are. So I suggest you go check it out! But to keep the flow going in this article, here’s a brief description. Unsupervised learning is a term used to explain how we let the machine learn on its own, and let it uncover hidden patterns in the data for us.
The word clustering means grouping. You want to put your data into different groups depending on their properties or similarities. It’s actually one of the most common exploratory data analysis techniques that we can use in order to get a deeper understanding about the structure of our data.
This is probably the best practical example of clustering we can use to better understand it. In this photo, we can see that the different objects are separated into subgroups based on their similarities, while each sub-group is also different from one another.
And so, the main thing that you need to know from clustering is that we try to find sub-groups within the data, wherein the data within the sub-groups are similar to each other, while the sub-groups themselves are different from each other.
Now you might be wondering how clustering applies in the real-world setting. It’s actually used in various ways.
But how does all of this even relate to unsupervised learning? Here’s a reminder of the description we shared earlier in this article. “Unsupervised learning is a term used to explain how we let the machine learn on its own, and let it uncover hidden patterns in the data for us.” So with clustering, we’re letting the machine investigate the structure of our data by grouping them into different sub-groups!
Now that we’ve covered what clustering is, it is now time to learn a clustering algorithm.
Since K-Means is a clustering algorithm, then that means the same concepts apply to it!
The clusters that K-Means generate are clusters that have similar data points, while the clusters themselves are as different (far) as possible from each other, hence the word distinct.
How about the word non-overlapping, what does that mean? It means that the clusters generated by this algorithm cannot touch or overlap each other.
Now the clusters that I have been talking about since the start of this article are represented by a centre or centroid. Always remember that these centroids are randomly initialized, unless we specify it in our model that we don’t want it to be “just random.” In order to do that, there’s this technique called K-Means++. We won’t be tackling that here, but make sure to take a look at this article by Sachit Mishra where she explains the application of K-Means++.
Back to the centroids! Each data point in our dataset will be included into a cluster whose centroid is closest to it. That is how the algorithm works.
And the ultimate goal of this algorithm is to just find clusters in our dataset that minimize the Sum of Squared Errors (the sum of squared distances between the centroid and its points).
I’ve already told you what K-Means algorithm is all about, but to be a great data scientist it’s better to understand how it actually works. Don’t worry, I won’t be including any Math in here....no computations at least.
Here is a photo I got from Alan Jeffares’ article on K-Means. From the left side, you can see that our data is scattered and there’s no way to tell which data points are similar to each other. By applying the K-Means algorithm, you can now distinguish which data points belong to which clusters.
“Is there any way I can programatically find the most optimal number of clusters?” and “How do I even measure the performance of a K-Means model if it just groups the data based on similarity?”
Those are probably questions that you’re already thinking about at this point, but thankfully, the answers are right here in this section. Now there are two main metrics that you need to know when dealing with K-Means algorithm - Inertia and Silhouette.
Inertia is basically the sum of squared distances of all data points to the centroid. In interpreting the value of inertia, generally the lower the inertia, the better since this metric measures how close the data points are to their centroids.
Remember the first question, “How can you find the most optimal number of clusters?” That’s where the Elbow method comes in.
The Elbow method gives us an idea on what a good number of K would be based on the inertia. It’s called an elbow method because the graph itself looks like an elbow.
Now I want you to look at this graph and determine where the sum of squared errors starts to form an elbow. If you think that K=2 is the optimal number of clusters then you are right! However, keep in mind that sometimes it’s still hard to figure out which number of K is good because the curve sometimes may not show any elbow.
Identifying the number of clusters to use is just one thing, but knowing the implication of what that number means is also important. Model evaluation is such an essential step that we need to do because what if the value of inertia is good, but the clusters themselves are overlapping? That would be problematic and we don’t want any more problems than we already have.
Introducing the silhouette score! It’s a metric that tells you how much overlap exists between the clusters. In other words, it calculates the distance of each cluster from one another. This metric ranges only between -1 and 1, and the closer it is to 1, the lesser the overlap.
Just like with inertia, we can also plot the silhouette score to find what is the best number of clusters to use.
This graph shows you that the number of clusters that are better in size, density, and separation is k=5.
In a nutshell, I’ve already explained what the K-Means algorithm really is. But as for every learning algorithm you might have already encountered, there are some assumptions that you need to know.
This means that K-Means works well in capturing structure of the data if the clusters have a spherical-like structure. If the clusters have complicated geometric shapes, the algorithm will do a poor job in uncovering any patterns to the data.
The graph on the left can clearly give us an idea on how the data points will be clustered based on how it is scattered. But with the graph on the right, there’s no clear way to determine the clusters.
Since K-Means relies solely on distance, it is very sensitive to outliers because in every iteration, the mean of the centroids will be affected by the outliers.
To solve this inevitable issue that you will encounter in the future, you can just simply look at the distribution of your data, spot the outliers, and remove them.
Alternatively, you can also increase the number of clusters and just hope that the outliers itself would form its own cluster.
I cannot stress this enough! It’s essential that you scale your data first before applying the K-Means algorithm. Why, you ask? Imagine you have two features, A and B. The values in A range from -1 to 1 and B is from -100 to 100. When you apply the model to these values, B will contribute more to the sum of squared errors than A.
By normalizing or scaling your data, we can minimize the contribution from B. In other words, we scale our data so that the difference in magnitude won’t affect the performance of the model since it is distance-based.
We will be using Scikit Learn to implement the K-Means algorithm. Scikit-learn is a free Python library that contains tools for machine learning projects.
To summarize, the K-Means algorithm is a powerful tool to explore a complex dataset and uncover any underlying patterns through the use of euclidean mathematics—this is a fancier way of saying distance. Woo! I told you there wouldn’t be any computations; I’ve got you, I’m here for you.
Congratulations, you’ve made it this far! You are now able to explain to your friends what Unsupervised Learning is, what clustering is all about, and where does the K-Means algorithm fit into all this. It’s now time to apply what you’ve learned into your projects to gain a deeper understanding of the algorithm!
Speaking of projects, did you know that in the 12-week Data Science Fellowship, you’ll be doing multiple projects using different technologies across different industries? Why not apply your newly-founded knowledge of K-Means in the bootcamp, and apply it to a real-world dataset?
Maybe you’re interested in finding out the patterns of all candidates that won during the previous senatorial elections, or you simply want to find out the different attributes of a certain company’s customers? Discover all those hidden patterns, trends, and insights in the vast ocean of data you’ll be working with in the bootcamp.
So go out there and find different clusters! Try and learn different things, and find what ignites your passion in working with data. I hope this article helped you gain an understanding of what, why, and how the K-Means algorithm works, and made machine learning a little less intimidating.
Always remember that to be a great data scientist, it’s important to know the assumptions and theories behind an algorithm.
Never stop learning!
From the notebook of Basty Vergara | Connect with Basty via LinkedIn and Notion
This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.
Updated for Data Science Fellowship Cohort 10 | Classes for Cohort 10 start on September 12, 2022.
If you’re ready to dive in
If you want to know more
Bootcamp payment options
Other Bootcamp features