Welcome to “Learn with Eskwelabs!” This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.

Basty’s Notebook

Hi, everyone! My name is Basty, and I’m a data scientist and educator from Metro Manila. If there’s something I value so much, that is education. Education has allowed me to not only dive into the amazing world of code and data, but also to encourage and inspire others to do the same. Read more about me here.

Outside of work and school, I love playing video games like Valorant and League of Legends. I also love listening to Broadway musicals (HAMILTON, DEH, TICK TICK BOOM ALL THE WAY!). Lastly, I LOVE watching Friends, New Girl, HIMYM, and The Big Bang Theory.

Now, let’s take a look at my notebook! IKUZO!! (Insert Naruto’s voice).


Image source


Image source

K-Means.....that’s it. As early as now, I want you to know that K does not actually stand for any word that starts with a letter K. In fact, K in the word K-Means actually stands for clusters or even centroids.

If you’re reading this article, it’s probably because you’ve heard of the term called Unsupervised Learning and Clustering, and that you’ve probably already looked at Supervised Learning algorithms on the way here. Well today is a good day to be reading this because in this article, I will share with you some practical knowledge that you’ll need in dealing with Clustering, specifically using the K-Means algorithm. Let’s go!



Before we even go about learning the K-Means algorithm, it’s essential to understand first what Unsupervised Learning is. Fortunately, there’s already this great article on what Machine Learning is, and it includes a simple-to-understand approach to what these different types of learnings are. So I suggest you go check it out! But to keep the flow going in this article, here’s a brief description. Unsupervised learning is a term used to explain how we let the machine learn on its own, and let it uncover hidden patterns in the data for us.

So...clustering, what is it?

The word clustering means grouping. You want to put your data into different groups depending on their properties or similarities. It’s actually one of the most common exploratory data analysis techniques that we can use in order to get a deeper understanding about the structure of our data.

Credits to the owner

This is probably the best practical example of clustering we can use to better understand it. In this photo, we can see that the different objects are separated into subgroups based on their similarities, while each sub-group is also different from one another.

And so, the main thing that you need to know from clustering is that we try to find sub-groups within the data, wherein the data within the sub-groups are similar to each other, while the sub-groups themselves are different from each other.

Clustering applications

Now you might be wondering how clustering applies in the real-world setting. It’s actually used in various ways.
  • Getting to know you, getting to know all about you — It’s used for market segmentation, wherein companies try to find customers that are similar to each other based on any given properties or attributes they have.
  • Please don’t fake it till you make it — With fake information becoming prolific, a lot of companies, most notably social media ones, use clustering algorithms to identify fake news based on the content.
  • Spam belongs on the breakfast table, not in your inbox — And you know those annoying emails you get that are often used for getting your personal data? Email companies use clustering to create spam filters to flag those emails as spam or not.
  • Similar searches — Perhaps the most notable application of clustering is something you’ve already been using, and that is the search engine! You’ve probably noticed that whenever you’re searching something on Google, you get so many results that are similar to each other. That’s the result of clustering!

But how does all of this even relate to unsupervised learning? Here’s a reminder of the description we shared earlier in this article. “Unsupervised learning is a term used to explain how we let the machine learn on its own, and let it uncover hidden patterns in the data for us.” So with clustering, we’re letting the machine investigate the structure of our data by grouping them into different sub-groups!





Image Source

Now that we’ve covered what clustering is, it is now time to learn a clustering algorithm.

First of all, what is K-Means?

  • It is an iterative algorithm that groups the dataset into pre-defined distinct non-overlapping subgroups or clusters.

Since K-Means is a clustering algorithm, then that means the same concepts apply to it!

The clusters that K-Means generate are clusters that have similar data points, while the clusters themselves are as different (far) as possible from each other, hence the word distinct.

How about the word non-overlapping, what does that mean? It means that the clusters generated by this algorithm cannot touch or overlap each other.

Now the clusters that I have been talking about since the start of this article are represented by a centre or centroid. Always remember that these centroids are randomly initialized, unless we specify it in our model that we don’t want it to be “just random.” In order to do that, there’s this technique called K-Means++. We won’t be tackling that here, but make sure to take a look at this article by Sachit Mishra where she explains the application of K-Means++.

Goal of K-Means algorithm

Back to the centroids! Each data point in our dataset will be included into a cluster whose centroid is closest to it. That is how the algorithm works.

And the ultimate goal of this algorithm is to just find clusters in our dataset that minimize the Sum of Squared Errors (the sum of squared distances between the centroid and its points).

Workflow of K-Means algorithm

I’ve already told you what K-Means algorithm is all about, but to be a great data scientist it’s better to understand how it actually works. Don’t worry, I won’t be including any Math in here....no computations at least.
  1. You first specify the number of clusters or K.
  2. The algorithm randomly assigns centroids to the dataset based on the number of K.
  3. The clusters are formed by assigning data points to the cluster in which they are closest to.
  4. Within those clusters, there will be new centroids that are based on the mean of all the data points within that respective cluster.
  5. This process will then iterate until the clusters are optimized. This just means that the centroids will no longer change even for the next iteration.

Image Source


Here is a photo I got from Alan Jeffares’ article on K-Means. From the left side, you can see that our data is scattered and there’s no way to tell which data points are similar to each other. By applying the K-Means algorithm, you can now distinguish which data points belong to which clusters.

Measuring model performance

“Is there any way I can programatically find the most optimal number of clusters?” and “How do I even measure the performance of a K-Means model if it just groups the data based on similarity?”

Those are probably questions that you’re already thinking about at this point, but thankfully, the answers are right here in this section. Now there are two main metrics that you need to know when dealing with K-Means algorithm - Inertia and Silhouette.

Inertia is basically the sum of squared distances of all data points to the centroid. In interpreting the value of inertia, generally the lower the inertia, the better since this metric measures how close the data points are to their centroids.

Remember the first question, “How can you find the most optimal number of clusters?” That’s where the Elbow method comes in.

The Elbow method gives us an idea on what a good number of K would be based on the inertia. It’s called an elbow method because the graph itself looks like an elbow.


Image Source

Now I want you to look at this graph and determine where the sum of squared errors starts to form an elbow. If you think that K=2 is the optimal number of clusters then you are right!

However, keep in mind that sometimes it’s still hard to figure out which number of K is good because the curve sometimes may not show any elbow.

Identifying the number of clusters to use is just one thing, but knowing the implication of what that number means is also important. Model evaluation is such an essential step that we need to do because what if the value of inertia is good, but the clusters themselves are overlapping? That would be problematic and we don’t want any more problems than we already have.

Introducing the silhouette score! It’s a metric that tells you how much overlap exists between the clusters. In other words, it calculates the distance of each cluster from one another. This metric ranges only between -1 and 1, and the closer it is to 1, the lesser the overlap.

Just like with inertia, we can also plot the silhouette score to find what is the best number of clusters to use.


Credits to the owner

This graph shows you that the number of clusters that are better in size, density, and separation is k=5.


In a nutshell, I’ve already explained what the K-Means algorithm really is. But as for every learning algorithm you might have already encountered, there are some assumptions that you need to know.

K-Means cannot handle non-spherical-like structures

This means that K-Means works well in capturing structure of the data if the clusters have a spherical-like structure. If the clusters have complicated geometric shapes, the algorithm will do a poor job in uncovering any patterns to the data.

Credits to the owner

The graph on the left can clearly give us an idea on how the data points will be clustered based on how it is scattered. But with the graph on the right, there’s no clear way to determine the clusters.

K-Means is sensitive to outliers

Since K-Means relies solely on distance, it is very sensitive to outliers because in every iteration, the mean of the centroids will be affected by the outliers.

To solve this inevitable issue that you will encounter in the future, you can just simply look at the distribution of your data, spot the outliers, and remove them.
Alternatively, you can also increase the number of clusters and just hope that the outliers itself would form its own cluster.

Always scale your data

I cannot stress this enough! It’s essential that you scale your data first before applying the K-Means algorithm. Why, you ask? Imagine you have two features, A and B. The values in A range from -1 to 1 and B is from -100 to 100. When you apply the model to these values, B will contribute more to the sum of squared errors than A.

By normalizing or scaling your data, we can minimize the contribution from B. In other words, we scale our data so that the difference in magnitude won’t affect the performance of the model since it is distance-based.






We will be using Scikit Learn to implement the K-Means algorithm. Scikit-learn is a free Python library that contains tools for machine learning projects.
# Let's say df contains our data
df = pd.read_csv("file name of csv")
# Select the features that you want to use for clustering
X = df["features you've selected"]

# We use Standardscaler to scale our data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Feature scaling
X = scaler.fit_tranform(X.values)

# First, intialize a random number for K
from skelarn.cluster import KMeans
# Define number of K
kmeans = KMeans(n_clusters=5)
# Fit the model to our scaled data
kmeans.fit(X)
# these will serve as your labels to identify which samples belong to which cluster
y_pred = kmeans.predict(X) 

# Model Evaluation
from sklearn.metrics import silhouette_score

s_score = silhouette_score(X, y_pred)

print("Inertia: " + str(kmeans.inertia_))
print("Silhouette: " + str(s_score))

# Let's now use the elbow method and silhouette score
inertia = []
sil = []

# fitting the model through different number of clusters to see which does better
for k in range(3,20):
    km = KMeans(n_clusters=k, random_state=1, init='k-means++')
    km.fit(X)
    y_pred = km.predict(X)
    
    inertia.append((k, km.inertia_))
    sil.append((k, silhouette_score(X, y_pred)))

# Now lets plot the results

fig, ax = plt.subplots(1,2, figsize=(14,6))

# plotting the elbow curve
x_iner = [x[0] for x in inertia]
y_iner = [x[1] for x in inertia]
ax[0].plot(x_iner, y_iner)
ax[0].set_xlabel('Number of Clusters')
ax[0].set_ylabel('Intertia')
ax[0].set_title("Inertia Score - AKA. 'Elbow Curve'")

# plotting the silhouette score
x_sil = [x[0] for x in sil]
y_sil = [x[1] for x in sil]
ax[1].plot(x_sil, y_sil)
ax[1].set_xlabel('Number of Clusters')
ax[1].set_ylabel('Silhouetter Score')
ax[1].set_title('Silhouetter Score Curve')

# After selecting the best number for K, re-train your model
kmeans = KMeans(n_clusters=7)
kmeans.fit(X)
labels = kmeans.predict(X)



To summarize, the K-Means algorithm is a powerful tool to explore a complex dataset and uncover any underlying patterns through the use of euclidean mathematics—this is a fancier way of saying distance. Woo! I told you there wouldn’t be any computations; I’ve got you, I’m here for you.

Congratulations, you’ve made it this far! You are now able to explain to your friends what Unsupervised Learning is, what clustering is all about, and where does the K-Means algorithm fit into all this. It’s now time to apply what you’ve learned into your projects to gain a deeper understanding of the algorithm!

Speaking of projects, did you know that in the 12-week Data Science Fellowship, you’ll be doing multiple projects using different technologies across different industries? Why not apply your newly-founded knowledge of K-Means in the bootcamp, and apply it to a real-world dataset?

Maybe you’re interested in finding out the patterns of all candidates that won during the previous senatorial elections, or you simply want to find out the different attributes of a certain company’s customers? Discover all those hidden patterns, trends, and insights in the vast ocean of data you’ll be working with in the bootcamp.

So go out there and find different clusters! Try and learn different things, and find what ignites your passion in working with data. I hope this article helped you gain an understanding of what, why, and how the K-Means algorithm works, and made machine learning a little less intimidating.

Always remember that to be a great data scientist, it’s important to know the assumptions and theories behind an algorithm.

Never stop learning!

From the notebook of Basty Vergara | Connect with Basty via LinkedIn and Notion

This series is called “From the Notebook of Our Fellows” because you will be guided by our very own alumni through a mix of basic and advanced data science concepts. Every time you read from one of our Fellows’ notebooks, just imagine that you have a data BFF or lifelong learning friend who’ll hold your hand at every step.


RECOMMENDED NEXT STEPS

Updated for Data Science Fellowship Cohort 10 | Classes for Cohort 10 start on September 12, 2022.

  • If you’re ready to dive in
    • Enroll in the Data Science Fellowship via the sign up link here and take the assessment exam.
      • Note: The assessment exam is a key part of your application. The deadline for the assessment is on August 21, 2022.
  • If you want to know more

YOUR NEXT READ