K-Means Clustering Algorithm Introduction

.ipynb from video

Hey everyone welcome back! Today we are going to be getting a crash course in K-Means Clustering, an Unsupervised Learning algorithm. In the bigger picture, this method will be used to identify archetypes/playstyles and group players by them based on NBA Advanced statistics. However, today we are going to be using a sample dataset just to go over the basics.

What is K-Means Clustering?

K-Means clustering is a grouping algorithm that seeks to minimize the sum of variance measurements across each cluster. For those not familiar with how to calculate the variance, the variance for each datapoint is as follows:

with

x = data point

m = mean value for entire cluster

variance = |x - m|^2

or in english, the absolute value of the datapoint minus the mean of the dataset, squared.

Then the variance for each data point in the cluster is summed to obtain a variance value for the cluster as a whole. This summed value is what the K-Means algorithm seeks to minimize for all clusters. This algorithm will continue to alter and reshape each cluster until the summed variance is minimized for ALL clusters involved.

On to the code.

First thing's first, we have our import statements:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

I don't believe we have used anything from matplotlib or sklearn yet, so if you do not have those installed on your system, you can use the following pip install statements to install them:

pip install sklearn
pip install matplotlib

Next we will be generating our dummy dataset using the make blob function. This function will predefine our data into distinct clusters, using the parameters we define. This is how we will cross-check our clustering results to verify it is working correctly.

X, y = make_blobs(n_samples=200, centers=4, cluster_std=.5, random_state=0)

Now this statement is assigning the (X,Y) values of each datapoint to the “X” variable, and the assigned blob for each datapoint to the “y” variable. Next, we will want to go ahead and graph our points to get a visual understanding of what we are working with.

plt.scatter(X[:,0], X[:,1])
output_4_1.png

We can clearly see here that the data is already split into 4 pretty clear datasets, as we specified 4 blobs in our make blobs function. This will make the analysis easy to understand from a high level without worrying too much about how many clusters to try to create.

After this dataset is generated and we have reviewed the data we have to work with, we will begin the process of utilizing the KMeans algorithm. While in this specific instance, we already know how many distinct clusters are present in the data, normally we would not know this going in. So we are going to operate as if we do not know.

This requires us to iterate over the KMeans function several times, and record the variance for each one for comparison. We will analyze how many clusters to create by utilizing what is called the Elbow method. We will go further into detail on these methods later on, but in short, the elbow method is basically just graphing the variance against the iteration number, and identifying the biggest slope change. This is going to be the cluster number that returns the greatest value of distinction between clusters.

To do this we will first be defining an empty list to hold the variance values from each iteration.

variance = []

Next we will establish our for loop over however many iterations we want, for now we will do 10 iterations to see the value in creating clusters 1 through 10.

for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    variance.append(kmeans.inertia_)

Some parameters explained here:

  • n_clusters=i

    • This simply assigns the number of clusters we are doing in this iteration to the iteration number we are on.

  • max_iter=300

    • The process of K-Means clustering iterates x number of times until it finds the ‘ideal’ cluster separations. This variable enforces a maximum number of iterations so it doesn’t run infinitely just ping-ponging back and forth if there isn’t a clear ‘optimal’ break point.

  • n_init=10

    • defines the number of times kmeans will be ran with different starting points to find the optimal starting point.

  • random_state=0

    • This simply defines a random state. This ensures that we will get the same results with the randomness aspect of the algorithm. If we were to change the random state the randomized aspect would be altered and slightly different results may ensue. The concept of the random_state is a little more complex, but it’s not terribly relevant for this scenario so we won’t go further into it now.

Now that we have ran our iterations and recorded our variance values we can begin the process of analyzing the clusters and deciding the optimal number. To do this we are going to start with plotting the variance for each iterations for the number of clusters.

plt.plot(range(1, 11), variance)
plt.xlabel('Number of clusters')
plt.ylabel('Variance')
plt.show()
output_9_0.png

While it may seem at first that the more clusters the better, because they will be more specific/accurate, at some point you begin getting diminishing returns. This is evident on the elbow plot by the slope of the line remaining static, (Straight Line) between different iterations. You CAN use more clusters, but the level of information gained is essentially meaningless.

To decide the ‘optimal’ number of clusters using the elbow method we are going to be utilizing the elbow method. This process is based on the change in slope in our variance v clusters graph. In this scenario we can pretty easily tell visually where the elbow occurs because the data is already separated into 4 clusters. However, the elbow method can be harder to implement if there is not a clear cut answer. You could go about using it in a more analytical way by graphing the slope change vs the number of clusters, and looking at it this way. That is not necessary for this dataset though so we won’t bother with the added complexity.

After we have decided how many clusters to use (4), we will then predict which cluster each point will fall into, and compare with the initial blob assignments to ensure a 1-1 relationship from blob to cluster.

So first off we will run our kmeans algorithm over the data, explicitly stating that we want to sort into 4 clusters.

kmeans = KMeans(n_clusters=4, max_iter=300, n_init=10, random_state=0)

Next, we will create a new list, pred_y to house the predicted cluster for each datapoint.

pred_y = kmeans.fit_predict(X)

Now, we will graph the centroid (Center) for each of our clusters over our existing data points to visually inspect.

plt.scatter(X[:,0], X[:,1])
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='orange')
plt.show()
output_10_0.png

Now that we can visually review our clusters and confirm that they make sense, we can double check and make sure the actual datapoints were placed into the correct cluster. To do this we will compare the predy variable with the y variable from our make blob function. First we will convert the pred_y vairable to a list and create a dataframe with these two variables for review.

pred_y = list(pred_y)
df = pd.DataFrame([y,pred_y])
df.head()
0 1 2 3 4 5 6 7 8 9 ... 190 191 192 193 194 195 196 197 198 199
0 1 2 0 3 1 0 2 0 0 0 ... 1 1 0 3 2 2 2 0 3 1
1 2 0 3 1 2 3 0 3 3 3 ... 2 2 3 1 0 0 0 3 1 2

2 rows × 200 columns

As we can see, we need to transform this data set a little bit for easier consumption. We will need to transpose this dataset, convert from a wide dataset to a tall dataset.

df = df.T
df.head()
0 1
0 1 2
1 2 0
2 0 3
3 3 1
4 1 2

Now that we can view the data a little easier, we need to isolate each y value and make sure the pred_y variable is constant throughout. they won’t be the same number, but as long as every pred_y value is the same then the clusters match the initial blobs. To analyze this we will just slice out the data from the dataframe.

df1 = df[df[0]==3]
df1
0 1
3 3 1
16 3 1
17 3 1
18 3 1
20 3 1
26 3 1
30 3 1
39 3 1
40 3 1
41 3 1
42 3 1
47 3 1
48 3 1
50 3 1
53 3 1
55 3 1
60 3 1
64 3 1
67 3 1
75 3 1
80 3 1
90 3 1
91 3 1
92 3 1
97 3 1
100 3 1
103 3 1
104 3 1
110 3 1
118 3 1
120 3 1
133 3 1
134 3 1
136 3 1
137 3 1
139 3 1
149 3 1
150 3 1
151 3 1
156 3 1
158 3 1
160 3 1
165 3 1
173 3 1
174 3 1
175 3 1
187 3 1
189 3 1
193 3 1
198 3 1

We can see here that every pred_y value is a 1 where the y value is a 3. This process can be repeated for each y value to verify.

That’s K-Means clustering in a nutshell, in can get more complicated than this using real world data, but these are the basics. Stay tuned for application to NBA stats for use in DFS!

Previous
Previous

Identifying NBA Player Archetypes Using K-Means Clustering — Part One

Next
Next

Intro to Machine Learning - Daily Fantasy Sports