Definitive Guide to K-Means Clustering with Scikit-Learn

Introduction

K-Means clustering is one of the most widely used unsupervised machine learning algorithms that form clusters of data based on the similarity between data instances.

In this guide, we will first take a look at a simple example to understand how the K-Means algorithm works before implementing it using Scikit-Learn. Then, we'll discuss how to determine the number of clusters (Ks) in K-Means, and also cover distance metrics, variance, and K-Means pros and cons.

Motivation

Imagine the following situation. One day, when walking around the neighborhood, you noticed there were 10 convenience stores and started to wonder which stores were similar - closer to each other in proximity. While searching for ways to answer that question, you've come across an interesting approach that divides the stores into groups based on their coordinates on a map.

For instance, if one store was located 5 km West and 3 km North - you'd assign (5, 3) coordinates to it, and represent it in a graph. Let's plot this first point to visualize what's happening:

import matplotlib.pyplot as plt

plt.title("Store With Coordinates (5, 3)")
plt.scatter(x=5, y=3)

This is just the first point, so we can get an idea of how we can represent a store. Say we already have 10 coordinates to the 10 stores collected. After organizing them in a numpy array, we can also plot their locations:

import numpy as np

points = np.array([[5, 3], [10, 15], [15, 12], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52],[80, 91]])

xs = points[:,0] # Selects all xs from the array
ys = points[:,1]  # Selects all ys from the array

plt.title("10 Stores Coordinates")
plt.scatter(x=xs, y=ys)

How to Manually Implement K-Means Algorithm

Now we can look at the 10 stores on a graph, and the main problem is to find is there a way they could be divided into different groups based on proximity? Just by taking a quick look at the graph, we'll probably notice two groups of stores - one is the lower points to the bottom-left, and the other one is the upper-right points. Perhaps, we can even differentiate those two points in the middle as a separate group - therefore creating three different groups.

In this section, we'll go over the process of manually clustering points - dividing them into the given number of groups. That way, we'll essentially carefully go over all steps of the K-Means clustering algorithm. By the end of this section, you'll gain both an intuitive and practical understanding of all steps performed during the K-Means clustering. After that, we'll delegate it to Scikit-Learn.

What would be the best way of determining if there are two or three groups of points? One simple way would be to simply choose one number of groups - for instance, two - and then try to group points based on that choice.

Let's say we have decided there are two groups of our stores (points). Now, we need to find a way to understand which points belong to which group. This could be done by choosing one point to represent group 1 and one to represent group 2. Those points will be used as a reference when measuring the distance from all other points to each group.

In that manner, say point (5, 3) ends up belonging to group 1, and point (79, 60) to group 2. When trying to assign a new point (6, 3) to groups, we need to measure its distance to those two points. In the case of the point (6, 3) is closer to the (5, 3), therefore it belongs to the group represented by that point - group 1. This way, we can easily group all points into corresponding groups.

In this example, besides determining the number of groups (clusters) - we are also choosing some points to be a reference of distance for new points of each group.

That is the general idea to understand similarities between our stores. Let's put it into practice - we can first choose the two reference points at random. The reference point of group 1 will be (5, 3) and the reference point of group 2 will be (10, 15). We can select both points of our numpy array by [0] and [1] indexes and store them in g1 (group 1) and g2 (group 2) variables:

g1 = points[0]
g2 = points[1]

After doing this, we need to calculate the distance from all other points to those reference points. This raises an important question - how to measure that distance. We can essentially use any distance measure, but, for the purpose of this guide, let's use Euclidean Distance_.

Advice: If you want learn more more about Euclidean distance, you can read our "Calculating Euclidean Distances with NumPy" guide.

It can be useful to know that Euclidean distance measure is based on Pythagoras' theorem:

$$
c^2 = a^2 + b^2
$$

When adapted to points in a plane - (a1, b1) and (a2, b2), the previous formula becomes:

$$
c^2 = (a2-a1)^2 + (b2-b1)^2
$$

The distance will be the square root of c, so we can also write the formula as:

$$
euclidean_{dist} = \sqrt[2][(a2 - a1)^2 + (b2 - b1) ^2)]
$$

Note: You can also generalize the Euclidean distance formula for multi-dimensional points. For example, in a three-dimensional space, points have three coordinates - our formula reflects that in the following way:
$$
euclidean_{dist} = \sqrt[2][(a2 - a1)^2 + (b2 - b1) ^2 + (c2 - c1) ^2)]
$$
The same principle is followed no matter the number of dimensions of the space we are operating in.

So far, we have picked the points to represent groups, and we know how to calculate distances. Now, let's put the distances and groups together by assigning each of our collected store points to a group.

To better visualize that, we will declare three lists. The first one to store points of the first group - points_in_g1. The second one to store points from the group 2 - points_in_g2, and the last one - group, to label the points as either 1 (belongs to group 1) or 2 (belongs to group 2):

points_in_g1 = []
points_in_g2 = []
group = []

We can now iterate through our points and calculate the Euclidean distance between them and each of our group references. Each point will be closer to one of two groups - based on which group is closest, we'll assign each point to the corresponding list, while also adding 1 or 2 to the group list:

for p in points:
    x1, y1 = p[0], p[1]
    euclidean_distance_g1 = np.sqrt((g1[0] - x1)**2 + (g1[1] - y1)**2)
    euclidean_distance_g2 = np.sqrt((g2[0] - x1)**2 + (g2[1] - y1)**2)
    if euclidean_distance_g1 < euclidean_distance_g2:
        points_in_g1.append(p)
        group.append('1')
    else:
        points_in_g2.append(p)
        group.append('2')

Let's look at the results of this iteration to see what happened:

print(f'points_in_g1:{points_in_g1}\n \
\npoints_in_g2:{points_in_g2}\n \
\ngroup:{group}')

Which results in:

points_in_g1:[array([5, 3])]
 
points_in_g2:[array([10, 15]), array([15, 12]), 
              array([24, 10]), array([30, 45]), 
              array([85, 70]), array([71, 80]),
              array([60, 78]), array([55, 52]), 
              array([80, 91])]
 
group:[1, 2, 2, 2, 2, 2, 2, 2, 2, 2] 

We can also plot the clustering result, with different colors based on the assigned groups, using Seaborn's scatterplot() with the group as a hue argument:

import seaborn as sns

sns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)

It's clearly visible that only our first point is assigned to group 1, and all other points were assigned to group 2. That result differs from what we had envisioned in the beginning. Considering the difference between our results and our initial expectations - is there a way we could change that? It seems there is!

One approach is to repeat the process and choose different points to be the references of the groups. This will change our results, hopefully, more in line with what we've envisioned in the beginning. This second time, we could choose them not at random as we previously did, but by getting a mean of all our already grouped points. That way, those new points could be positioned in the middle of corresponding groups.

For instance, if the second group had only points (10, 15), (30, 45). The new central point would be (10 + 30)/2 and (15+45)/2 - which is equal to (20, 30).

Since we have put our results in lists, we can convert them first to numpy arrays, select their xs, ys and then obtain the mean:

g1_center = [np.array(points_in_g1)[:, 0].mean(), np.array(points_in_g1)[:, 1].mean()]
g2_center = [np.array(points_in_g2)[:, 0].mean(), np.array(points_in_g2)[:, 1].mean()]
g1_center, g2_center

Advice: Try to use numpy and NumPy arrays as much as possible. They are optimized for better performance and simplify many linear algebra operations. Whenever you are trying to solve some linear algebra problem, you should definitely take a look at the numpy documentation to check if there is any numpy method designed to solve your problem. The chance is that there is!

To help repeat the process with our new center points, let's transform our previous code into a function, execute it and see if there were any changes in how the points are grouped:

def assigns_points_to_two_groups(g1_center, g2_center):
    points_in_g1 = []
    points_in_g2 = []
    group = []

    for p in points:
        x1, y1 = p[0], p[1]
        euclidean_distance_g1 = np.sqrt((g1_center[0] - x1)**2 + (g1_center[1] - y1)**2)
        euclidean_distance_g2 = np.sqrt((g2_center[0] - x1)**2 + (g2_center[1] - y1)**2)
        if euclidean_distance_g1 < euclidean_distance_g2:
            points_in_g1.append(p)
            group.append(1)
        else:
            points_in_g2.append(p)
            group.append(2)
    return points_in_g1, points_in_g2, group

Note: If you notice you keep repeating the same code over and over again, you should wrap that code into a separate function. It is considered a best practice to organize code into functions, especially because they facilitate testing. It is easier to test an isolated piece of code than a full code without any functions.

Let's call the function and store its results in points_in_g1, points_in_g2, and group variables:

points_in_g1, points_in_g2, group = assigns_points_to_two_groups(g1_center, g2_center)
points_in_g1, points_in_g2, group

And also plot the scatter plot with the colored points to visualize the groups division:

sns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)

It seems the clustering of our points is getting better. But still, there are two points in the middle of the graph that could be assigned to either group when considering their proximity to both groups. The algorithm we've developed so far assigns both of those points to the second group.

This means we can probably repeat the process once more by taking the means of the Xs and Ys, creating two new central points (centroids) to our groups and re-assigning them based on distance.

Let's also create a function to update the centroids. The whole process now can be reduced to multiple calls of that function:

def updates_centroids(points_in_g1, points_in_g2):
    g1_center = np.array(points_in_g1)[:, 0].mean(), np.array(points_in_g1)[:, 1].mean()
    g2_center = np.array(points_in_g2)[:, 0].mean(), np.array(points_in_g2)[:, 1].mean()
    return g1_center, g2_center

g1_center, g2_center = updates_centroids(points_in_g1, points_in_g2)
points_in_g1, points_in_g2, group = assigns_points_to_two_groups(g1_center, g2_center)
sns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)

Notice that after this third iteration, each one of the points now belong to different clusters. It seems the results are getting better - let's do it once again. Now going to the fourth iteration of our method:

g1_center, g2_center = updates_centroids(points_in_g1, points_in_g2)
points_in_g1, points_in_g2, group = assigns_points_to_two_groups(g1_center, g2_center)
sns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)

This fourth time we got the same result as the previous one. So it seems our points won't change groups anymore, our result has reached some kind of stability - it has got to an unchangeable state, or converged. Besides that, we have exactly the same result as we had envisioned for the 2 groups. We can also see if this reached division makes sense.

Let's just quickly recap what we've done so far. We've divided our 10 stores geographically into two sections - ones in the lower southwest regions and others in the northeast. It can be interesting to gather more data besides what we already have - revenue, the daily number of customers, and many more. That way we can conduct a richer analysis and possibly generate more interesting results.

Clustering studies like this can be conducted when an already established brand wants to pick an area to open a new store. In that case, there are many more variables taken into consideration besides location.

What Does All This Have To Do With K-Means Algorithm?

While following these steps you might have wondered what they have to do with the K-Means algorithm. The process we've conducted so far is the K-Means algorithm. In short, we've determined the number of groups/clusters, randomly chosen initial points, and updated centroids in each iteration until clusters converged. We've basically performed the entire algorithm by hand - carefully conducting each step.

The K in K-Means comes from the number of clusters that need to be set prior to starting the iteration process. In our case K = 2. This characteristic is sometimes seen as negative considering there are other clustering methods, such as Hierarchical Clustering, which don't need to have a fixed number of clusters beforehand.

Due to its use of means, K-means also becomes sensitive to outliers and extreme values - they enhance the variability and make it harder for our centroids to play their part. So, be conscious of the need to perform extreme values and outlier analysis before conducting a clustering using the K-Means algorithm.

Also, notice that our points were segmented in straight parts, there aren't curves when creating the clusters. That can also be a disadvantage of the K-Means algorithm.

Note: When you need it to be more flexible and adaptable to ellipses and other shapes, try using a generalized K-means Gaussian Mixture model. This model can adapt to elliptical segmentation clusters.

K-Means also has many advantages! It performs well on large datasets which can become difficult to handle if you are using some types of hierarchical clustering algorithms. It also guarantees convergence, and can easily generalize and adapt. Besides that, it is probably the most used clustering algorithm.

Now that we've gone over all the steps performed in the K-Means algorithm, and understood all its pros and cons, we can finally implement K-Means using the Scikit-Learn library.

How to Implement K-Means Algorithm Using Scikit-Learn

To double check our result, let's do this process again, but now using 3 lines of code with sklearn:

from sklearn.cluster import KMeans

# The random_state needs to be the same number to get reproducible results
kmeans = KMeans(n_clusters=2, random_state=42) 
kmeans.fit(points)
kmeans.labels_

Here, the labels are the same as our previous groups. Let's just quickly plot the result:

sns.scatterplot(x = points[:,0], y = points[:,1], hue=kmeans.labels_)

The resulting plot is the same as the one from the previous section.

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Note: Just looking at how we've performed the K-Means algorithm using Scikit-Learn might give you the impression that this is a no-brainer and that you don't need to worry too much about it. Just 3 lines of code perform all the steps we've discussed in the previous section when we've gone over the K-Means algorithm step-by-step. But, the devil is in the details in this case! If you don't understand all the steps and limitations of the algorithm, you'll most likely face the situation where the K-Means algorithm gives you results you were not expecting.

With Scikit-Learn, you can also initialize K-Means for faster convergence by setting the init='k-means++' argument. In broader terms, K-Means++ still chooses the k initial cluster centers at random following a uniform distribution. Then, each subsequent cluster center is chosen from the remaining data points not by calculating only a distance measure - but by using probability. Using the probability speeds up the algorithm and it's helpful when dealing with very large datasets.

Advice: You can learn more about K-Means++ details by reading the "K-Means++: The Advantages of Careful Seeding" paper, proposed in 2007 by David Arthur and Sergei Vassilvitskii.

The Elbow Method - Choosing the Best Number of Groups

So far, so good! We've clustered 10 stores based on the Euclidean distance between points and centroids. But what about those two points in the middle of the graph that are a little harder to cluster? Couldn't they form a separate group as well? Did we actually make a mistake by choosing K=2 groups? Maybe we actually had K=3 groups? We could even have more than three groups and not be aware of it.

The question being asked here is how to determine the number of groups (K) in K-Means. To answer that question, we need to understand if there would be a "better" cluster for a different value of K.

The naive way of finding that out is by clustering points with different values of K, so, for K=2, K=3, K=4, and so on:

for number_of_clusters in range(1, 11): 
    kmeans = KMeans(n_clusters = number_of_clusters, random_state = 42)
    kmeans.fit(points) 

But, clustering points for different Ks alone won't be enough to understand if we've chosen the ideal value for K. We need a way to evaluate the clustering quality for each K we've chosen.

Manually Calculating the Within Cluster Sum of Squares (WCSS)

Here is the ideal place to introduce a measure of how much our clustered points are close to each other. It essentially describes how much variance we have inside a single cluster. This measure is called Within Cluster Sum of Squares, or WCSS for short. The smaller the WCSS is, the closer our points are, therefore we have a more well-formed cluster. The WCSS formula can be used for any number of clusters:

$$
WCSS = \sum(Pi_1 - Centroid_1)^2 + \cdots + \sum(Pi_n - Centroid_n)^2
$$

Note: In this guide, we are using the Euclidean distance to obtain the centroids, but other distance measures, such as Manhattan, could also be used.

Now we can assume we've opted to have two clusters and try to implement the WCSS to understand better what the WCSS is and how to use it. As the formula states, we need to sum up the squared differences between all cluster points and centroids. So, if our first point from the first group is (5, 3) and our last centroid (after convergence) of the first group is (16.8, 17.0), the WCSS will be:

$$
WCSS = \sum((5,3) - (16.8, 17.0))^2
$$

$$
WCSS = \sum((5-16.8) + (3-17.0))^2
$$

$$
WCSS = \sum((-11.8) + (-14.0))^2
$$

$$
WCSS = \sum((-25.8))^2
$$

$$
WCSS = 335.24
$$

This example illustrates how we calculate the WCSS for the one point from the cluster. But the cluster usually contains more than one point, and we need to take all of them into consideration when calculating the WCSS. We'll do that by defining a function that receives a cluster of points and centroids, and returns the sum of squares:

def sum_of_squares(cluster, centroid):
    squares = []
    for p in cluster:
        squares.append((p - centroid)**2)
        ss = np.array(squares).sum()
    return ss

Now we can get the sum of squares for each cluster:

g1 = sum_of_squares(points_in_g1, g1_center)
g2 = sum_of_squares(points_in_g2, g2_center)

And sum up the results to obtain the total WCSS:

g1 + g2

This results in:

2964.3999999999996

So, in our case, when K is equal to 2, the total WCSS is 2964.39. Now, we can switch Ks and calculate the WCSS for all of them. That way, we can get an insight into what K we should choose to make our clustering perform the best.

Calculating WCSS Using Scikit-Learn

Fortunately, we don't need to manually calculate the WCSS for each K. After performing the K-Means clustering for the given number of clusters, we can obtain its WCSS by using the inertia_ attribute. Now, we can go back to our K-Means for loop, use it to switch the number of clusters, and list corresponding WCSS values:

wcss = [] 
for number_of_clusters in range(1, 11): 
    kmeans = KMeans(n_clusters = number_of_clusters, random_state = 42)
    kmeans.fit(points) 
    wcss.append(kmeans.inertia_)
wcss

Notice that the second value in the list, is exactly the same we've calculated before for K=2:

[18272.9, # For k=1 
 2964.3999999999996, # For k=2
 1198.75, # For k=3
 861.75,
 570.5,
 337.5,
 175.83333333333334,
 79.5,
 17.0,
 0.0]

To visualize those results, let's plot our Ks along with the WCSS values:

ks = [1, 2, 3, 4, 5 , 6 , 7 , 8, 9, 10]
plt.plot(ks, wcss)

There is an interruption on a plot when x = 2, a low point in the line, and an even lower one when x = 3. Notice that it reminds us of the shape of an elbow. By plotting the Ks along with the WCSS, we are using the Elbow Method to choose the number of Ks. And the chosen K is exactly the lowest elbow point, so, it would be 3 instead of 2, in our case:

ks = [1, 2, 3, 4, 5 , 6 , 7 , 8, 9, 10]
plt.plot(ks, wcss);
plt.axvline(3, linestyle='--', color='r')

We can run the K-Means cluster algorithm again, to see how our data would look like with three clusters:

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(points)
sns.scatterplot(x = points[:,0], y = points[:,1], hue=kmeans.labels_)

We were already happy with two clusters, but according to the elbow method, three clusters would be a better fit for our data. In this case, we would have three kinds of stores instead of two. Before using the elbow method, we thought about southwest and northeast clusters of stores, now we also have stores in the center. Maybe that could be a good location to open another store since it would have less competition nearby.

Alternative Cluster Quality Measures

There are also other measures that can be used when evaluating cluster quality:

  • Silhouette Score - analyzes not only the distance between intra-cluster points but also between clusters themselves
  • Between Clusters Sum of Squares (BCSS) - metric complementary to the WCSS
  • Sum of Squares Error (SSE)
  • Maximum Radius - measures the largest distance from a point to its centroid
  • Average Radius - the sum of the largest distance from a point to its centroid divided by the number of clusters.

It's recommended to experiment and get to know each of them since depending on the problem, some of the alternatives can be more applicable than the most widely used metrics (WCSS and Silhouette Score).

In the end, as with many data science algorithms, we want to reduce the variance inside each cluster and maximize the variance between different clusters. So we have more defined and separable clusters.

Applying K-Means on Another Dataset

Let's use what we have learned on another dataset. This time, we will try to find groups of similar wines.

Note: You can download the dataset here.

We begin by importing pandas to read the wine-clustering CSV (Comma-Separated Values) file into a Dataframe structure:

import pandas as pd

df = pd.read_csv('wine-clustering.csv')

After loading it, let's take a peek at the first five records of data with the head() method:

df.head()

This results in:

    Alcohol 	Malic_Acid 	Ash 	Ash_Alcanity 	Magnesium 	Total_Phenols 	Flavonoids 	Nonflavanoid_Phenols 	Proanthocyanidins 	Color_Intensity 	Hue 	OD280 	Proline
0 	14.23 		1.71 		2.43 	15.6 			127 		2.80 			3.06 		0.28 					2.29 				5.64 				1.04 	3.92 	1065
1 	13.20 		1.78 		2.14 	11.2 			100 		2.65 			2.76 		0.26 					1.28 				4.38 				1.05 	3.40 	1050
2 	13.16 		2.36 		2.67 	18.6 			101 		2.80 			3.24 		0.30 					2.81 				5.68 				1.03 	3.17 	1185
3 	14.37 		1.95 		2.50 	16.8 			113 		3.85 			3.49 		0.24 					2.18 				7.80 				0.86 	3.45 	1480
4 	13.24 		2.59 		2.87 	21.0 			118 		2.80 			2.69 		0.39 					1.82 				4.32 				1.04 	2.93 	735

We have many measurements of substances present in wines. Here, we also won't need to transform categorical columns because all of them are numerical. Now, let's take a look at the descriptive statistics with the describe() method:

df.describe().T # T is for transposing the table

The describe table:

                         count 	mean 		std 		min 	25% 	50% 	75% 		max
Alcohol 				178.0 	13.000618 	0.811827 	11.03 	12.3625 13.050 	13.6775 	14.83
Malic_Acid 				178.0 	2.336348 	1.117146 	0.74 	1.6025 	1.865 	3.0825 		5.80
Ash 					178.0 	2.366517 	0.274344 	1.36 	2.2100 	2.360 	2.5575 		3.23
Ash_Alcanity 			178.0 	19.494944 	3.339564 	10.60 	17.2000 19.500 	21.5000 	30.00
Magnesium 				178.0 	99.741573 	14.282484 	70.00 	88.0000 98.000 	107.0000 	162.00
Total_Phenols 			178.0 	2.295112 	0.625851 	0.98 	1.7425 	2.355 	2.8000 		3.88
Flavonoids 				178.0 	2.029270 	0.998859 	0.34 	1.2050 	2.135 	2.8750 		5.08
Nonflavanoid_Phenols 	178.0 	0.361854 	0.124453 	0.13 	0.2700 	0.340 	0.4375 		0.66
Proanthocyanidins 		178.0 	1.590899 	0.572359 	0.41 	1.2500 	1.555 	1.9500 		3.58
Color_Intensity 		178.0 	5.058090 	2.318286 	1.28 	3.2200 	4.690 	6.2000 		13.00
Hue 					178.0 	0.957449 	0.228572 	0.48 	0.7825 	0.965 	1.1200 		1.71
OD280 					178.0 	2.611685 	0.709990 	1.27 	1.9375 	2.780 	3.1700 		4.00
Proline 				178.0 	746.893258 	314.907474 	278.00 	500.500 673.500 985.0000 	1680.00

By looking at the table it is clear that there is some variability in the data - for some columns such as Alcohol there is more, and for others, such as Malic_Acid, less. Now we can check if there are any null, or NaN values in our dataset:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Alcohol               178 non-null    float64
 1   Malic_Acid            178 non-null    float64
 2   Ash                   178 non-null    float64
 3   Ash_Alcanity          178 non-null    float64
 4   Magnesium             178 non-null    int64  
 5   Total_Phenols         178 non-null    float64
 6   Flavonoids            178 non-null    float64
 7   Nonflavanoid_Phenols  178 non-null    float64
 8   Proanthocyanidins       178 non-null    float64
 9   Color_Intensity       178 non-null    float64
 10  Hue                   178 non-null    float64
 11  OD280                 178 non-null    float64
 12  Proline               178 non-null    int64  
dtypes: float64(11), int64(2)
memory usage: 18.2 KB

There's no need to drop or input data, considering there aren't empty values in the dataset. We can use a Seaborn pairplot() to see the data distribution and to check if the dataset forms pairs of columns that can be interesting for clustering:

sns.pairplot(df)

By looking at the pair plot, two columns seem promising for clustering purposes - Alcohol and OD280 (which is a method for determining the protein concentration in wines). It seems that there are 3 distinct clusters on plots combining two of them.

There are other columns that seem to be in correlation as well. Most notably Alcohol and Total_Phenols, and Alcohol and Flavonoids. They have great linear relationships that can be observed in the pair plot.

Since our focus is clustering with K-Means, let's choose one pair of columns, say Alcohol and OD280, and test the elbow method for this dataset.

Note: When using more columns of the dataset, there will be a need for either plotting in 3 dimensions or reducing the data to principal components (use of PCA). This is a valid, and more common approach, just make sure to choose the principal components based on how much they explain and keep in mind that when reducing the data dimensions, there is some information loss - so the plot is an approximation of the real data, not how it really is.

Let's plot the scatter plot with those two columns set to be its axis to take a closer look at the points we want to divide into groups:

sns.scatterplot(data=df, x='OD280', y='Alcohol')

Now we can define our columns and use the elbow method to determine the number of clusters. We will also initiate the algorithm with kmeans++ just to make sure it converges more quickly:

values = df[['OD280', 'Alcohol']]

wcss_wine = [] 
for i in range(1, 11): 
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(values) 
    wcss_wine.append(kmeans.inertia_)

We have calculated the WCSS, so we can plot the results:

clusters_wine = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
plt.plot(clusters_wine, wcss_wine)
plt.axvline(3, linestyle='--', color='r')

According to the elbow method we should have 3 clusters here. For the final step, let's cluster our points into 3 clusters and plot the those clusters identified by colors:

kmeans_wine = KMeans(n_clusters=3, random_state=42)
kmeans_wine.fit(values)
sns.scatterplot(x = values['OD280'], y = values['Alcohol'], hue=kmeans_wine.labels_)

We can see clusters 0, 1, and 2 in the graph. Based on our analysis, group 0 has wines with higher protein content and lower alcohol, group 1 has wines with higher alcohol content and low protein, and group 2 has both high protein and high alcohol in its wines.

This is a very interesting dataset and I encourage you to go further into the analysis by clustering the data after normalization and PCA - also by interpreting the results and finding new connections.

Conclusion

K-Means clustering is a simple yet very effective unsupervised machine learning algorithm for data clustering. It clusters data based on the Euclidean distance between data points. K-Means clustering algorithm has many uses for grouping text documents, images, videos, and much more.

Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Cássia SampaioAuthor

Data Scientist, Research Software Engineer, and teacher. Cassia is passionate about transformative processes in data, technology and life. She is graduated in Philosophy and Information Systems, with a Strictu Sensu Master's Degree in the field of Foundations Of Mathematics.

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms