Jupyter Notebook File to follow along with

Hey everyone welcome back! Today we are going to be utilizing and expanding upon the basics of K-Means clustering to identify player groupings based on playstyle.

This is a pretty big topic so we are going to break into two parts for easier consumption. The overall process is going to be as follows:

Part One
1. Acquire data — Dataset provided for repeatability purposes here.
2. Import data into pandas dataframe.
3. Create ‘features’ dataframe for data to use (not including player names).
4. Standardize data by feature for uniformity.
5. Perform Principal Component Analysis (PCA)
  1. Dimension Reduction for more efficient analysis.
  2. Define how many components to reduce to.
Part Two
1. Identify the optimal number of clusters.
  1. Using silhouette method.
2. Run K-Means algorithm to create clusters
3. Review clusters
  1. If unhappy with results rerun with different components and clusters for varying results.

Acquire Data

For the sake of consistency and ease of use I’m going to provide an example dataset so everyone can use the same data and not worry about scraping and cleaning it up yourself for now.

Example Dataset

Import Data Into Pandas Dataframe

First things first, we need to import the packages we are going to be using,

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import pickle

Next up, we are going to load our dataset into a pandas dataframe

df = pd.read_excel(r"FILEPATH\2019-20allStatsClean.xlsx").set_index('Player').astype(np.float)

Now, this dataframe is going to already have the player column as the index, but we will need to create a copy of this dataframe with the player column as a normal column, so we are going to define a new dataframe with the reset index function, and verify that our initial dataframe remains unchanged.

dfPlayerCol = df.reset_index()
df.head()

	Advanced ASTpercent	Advanced OREBpercent	Advanced DREBpercent	Advanced REBpercent	Advanced TSpercent	Advanced USGpercent	Advanced PACE	Advanced PIE	postupdefense Poss	postupdefense PPP	...	Scoring 2FGMpercentAST	Scoring 3FGMpercentAST	Scoring FGMpercentAST	spotupoffense Poss	spotupoffense PPP	touches Time_OfPoss	touches Avg_Sec_PerTouch	touches Avg_Drib_PerTouch	TransitionOffensive Poss	TransitionOffensive PPP
Player
Langston Galloway	8.8	1.8	6.5	4.1	58.3	14.5	100.78	7.4	0.2	1.14	...	73.3	100.0	88.4	3.3	1.20	0.9	1.94	1.25	1.6	1.32
Anfernee Simons	10.2	1.6	8.4	5.0	51.7	18.1	103.10	6.1	0.4	0.82	...	35.0	68.8	46.5	2.4	0.89	2.1	3.80	3.25	1.4	0.93
Bobby Portis	11.9	5.9	18.1	11.8	51.3	20.2	101.75	10.3	0.7	0.89	...	52.3	98.2	64.7	2.7	0.99	1.1	1.68	0.69	1.0	1.29
Coby White	15.5	1.7	13.0	6.9	47.7	22.3	102.28	7.6	0.2	1.17	...	28.0	78.6	50.2	3.4	0.94	2.7	3.82	3.21	2.5	1.06
Dorian Finney-Smith	6.8	6.7	11.2	9.0	59.4	12.8	99.37	7.1	1.0	0.84	...	56.8	96.6	76.1	3.8	1.03	1.1	1.85	0.85	1.6	1.24

5 rows × 30 columns

Create ‘features’ dataframe for data to use (not including player names).

Now we need to separate out our ‘features’ for the k-means algorithm. Typically when talking about machine learning, features are going to refer to the data you want to perform the algorithm on. So we need to create a new dataframe consisting of just the data, to do this we are going to create a list of column names from our initial datafame, (this is important because since the player name is the index, it will not be returned as a column name). Next, we will create the dataframe from these values.

features = list(df.columns)
x= df.loc[:, features].values

Standardizing Data

Finally, we need to standardize this data. Standardization is important for effective machine learning when your dataset has a high variance, standardizing the data allows for a more accurate analysis with most algorithms.

x= StandardScaler().fit_transform(x)

Perform Principal Component Analysis (PCA)

Next up, we need to perform some dimension reduction to improve analysis on this dataset. Dimension in this context refers to the number of stats we are using. The method we are going to be using is Principal Component Analysis. While this isn’t a super simple concept, using nba stats as an example makes it a little easier to understand. The first thing to cover here is correlation within the dataset. Certain features within a dataset are going to have a certain amount of correlation to other features, think 3PM, 3PA, and 3P%. While individual stats you could use in an analysis, they are going to be related to each other, and someone who has a high 3PA is going to probably have a high 3PM, relative to someone who takes an average amount of 3s.

Dimension Reduction For More Efficient Analysis

Performing a dimensionality reduction helps eliminate these correlations by reducing the number of features down and assigning representative numbers to those groupings. These groupings are referred to as principal components, hence the name of the method. We will be referring to them as components for shorthand going forward. Now we are going to need to figure out how many components to reduce our data down to. This is done by measuring what’s called the explained variance, which in normal words means how much of the variance from our initial dataset is conveyed with this number of components. Our dataset contains 30 features, so while normally we would only be concerned with less than 30 components, for the sake of analysis we are going to analyze for 2-30 components so we can see the diminishing returns of using the whole dataset.

To perform this analysis we will first need to define an empty list to house our variance from component to component

variance_list = []

Next we will loop over components 2-30 and calculate the explained variance for each PCA analysis with that number of components. To do this we will standardize the components for each player, calculate the explained variance ratio (ratio of variance coming through in this PCA run against that of the initial dataset), and drop that variance into our empty variance list and print out the number of components and the explained variance ratio for reference.

for n_components in range(2, 31):
    pca = PCA(n_components = n_components)
    components = pca.fit_transform(x)
    pca_variance = sum(pca.explained_variance_ratio_)
    pca_list.append(pca_variance)
    print("For n_components = {}, explained variance ratio is {}".format(n_components, pca_variance))

For n_components = 2, explained variance ratio is 0.43925800571275253
For n_components = 3, explained variance ratio is 0.5676290453882779
For n_components = 4, explained variance ratio is 0.6268184169226134
For n_components = 5, explained variance ratio is 0.6755451751246611
For n_components = 6, explained variance ratio is 0.7169413050732253
For n_components = 7, explained variance ratio is 0.7490113047677626
For n_components = 8, explained variance ratio is 0.7782932819597761
For n_components = 9, explained variance ratio is 0.8053024983551436
For n_components = 10, explained variance ratio is 0.8287327218042008
For n_components = 11, explained variance ratio is 0.8506660305368813
For n_components = 12, explained variance ratio is 0.871763145207483
For n_components = 13, explained variance ratio is 0.8886322192886172
For n_components = 14, explained variance ratio is 0.9044923898506935
For n_components = 15, explained variance ratio is 0.9189769957564599
For n_components = 16, explained variance ratio is 0.931623538335299
For n_components = 17, explained variance ratio is 0.9432894291671697
For n_components = 18, explained variance ratio is 0.9535471373009715
For n_components = 19, explained variance ratio is 0.9620839200954308
For n_components = 20, explained variance ratio is 0.9700179494996609
For n_components = 21, explained variance ratio is 0.9770374173363002
For n_components = 22, explained variance ratio is 0.9833454693624457
For n_components = 23, explained variance ratio is 0.9889337515272374
For n_components = 24, explained variance ratio is 0.992841751423133
For n_components = 25, explained variance ratio is 0.9953131638537188
For n_components = 26, explained variance ratio is 0.9973323744199031
For n_components = 27, explained variance ratio is 0.9991920447481422
For n_components = 28, explained variance ratio is 0.9997028257674399
For n_components = 29, explained variance ratio is 0.9999118952543059
For n_components = 30, explained variance ratio is 0.9999999999999999

As we can see from that output, the explained variance goes up with each added component, which makes sense. However, as we get close to the full dataset, we can clearly see the variance starts going up by less and less with each iteration, these are the diminishing returns I was referring to earlier. Now that we have our explained variances for the each component, we can begin analyzing to identify the optimal number of components to use.

Define How Many Components To Reduce To

The first thing we will want to do is visualize our previous data output in a graph. We just need to define our variables and labels and graph style and matplotlib.pyplot will take care of the graphing itself.

plt.style.use('fivethirtyeight')

pca_fig, ax = plt.subplots()

ax.plot(range(2, 31), variance_list)

ax.set_xlabel('Component #')
ax.set_ylabel('Explained Variance Ratio')

ax.set_xticks(np.arange(2, 32, 2.0))

pca_fig.suptitle("Explained Variance Per Component #", weight = 'bold', size = 29)

Now that we can see visually the explained variance ratios from before, we are going to take a closer look at the change in slope of this curve from component to component to really get a feel for how much the ratio is changing.

from numpy import diff

dx = 1
y = variance_list
dy = diff(y)/dx
print(dy)

Next up, we are going to go ahead and graph the slope change for each component similar to how we just graphed the explained variance ratio.

pca_deriv, ax = plt.subplots()

ax.plot(range(3, 31), dy)

ax.set_xlabel('Components')
ax.set_ylabel('dy / dx')

ax.set_xticks(np.arange(3, 32, 2.0))

pca_deriv.suptitle("Components Rate of Change", weight = 'bold', size = 29)

Now this graph may look familiar to you if you have looked over the K-Means Clustering Algorithm Introduction post, when we were using the elbow method to determine the number of clusters to use. While this is a completely different process, we are essentially looking for the same trend in data. Unfortunately, this graph doesn’t have a clear cut elbow to identify visually. However, there is a pretty stable slope from components 13-19, so we are going to go ahead and use 13 components going forward. This is a subjective decision, and you are obviously free to play around with this decision in your own algorithms, but we will be using 13 going forward with this tutorial.

Alright everyone, that’s it for part one, and as long as it is I think you will all agree that breaking into 2 parts is definitely necessary! For part 2 we will get into defining the number of clusters we want to use, and actually running the K-Means algorithm and reviewing the clusters to see if they make sense or not!

Identifying NBA Player Archetypes Using K-Means Clustering — Part One