Identifying NBA Player Archetypes Using K-Means Clustering — Part Two
So now that we’ve identified the number of Principal Components we are going to be using, we can move on to using those components to analyze how many clusters we want to sort the players in to.
If you have no idea what I’m talking about, check out Part one in this series to get caught up with what we are working on.
Okay, now that you’re caught up, we’ll be picking up right where we left off previously. Using 13 PCA components, we can now start to get to work identifying our actual player clusters. The first step is to assign each player their score for each principal component. To do this we will be using the pca.fit_transform function to evaluate each player based on these principal components, so that we can start to identify our clusters.
pca = PCA(n_components = 13)
components = pca.fit_transform(x)
pca_df = pd.DataFrame(data = components, columns = ['PC 1', 'PC 2', 'PC 3', 'PC 4', 'PC 5', 'PC 6',
'PC 7', 'PC 8', 'PC 9', 'PC 10', 'PC 11', 'PC 12', 'PC 13'])
pca_df['Player'] = dfPlayerCol['Player']
pca_df = pca_df[['Player', 'PC 1', 'PC 2', 'PC 3', 'PC 4', 'PC 5', 'PC 6', 'PC 7', 'PC 8', 'PC 9', 'PC 10', 'PC 11', 'PC 12',
'PC 13']]
pca_df.head()
Player | PC 1 | PC 2 | PC 3 | PC 4 | PC 5 | PC 6 | PC 7 | PC 8 | PC 9 | PC 10 | PC 11 | PC 12 | PC 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Langston Galloway | -0.629264 | -1.488003 | -3.224213 | 0.429696 | -0.989677 | -0.307569 | -0.320236 | 0.388100 | -0.168554 | 0.233534 | 0.018514 | 0.162524 | -0.264568 |
1 | Anfernee Simons | -2.552319 | -0.620829 | 0.031621 | -0.291686 | -0.299234 | -0.845852 | 0.996665 | 0.410733 | 0.063045 | -0.192359 | -0.278499 | 0.476688 | -0.713373 |
2 | Bobby Portis | 1.905637 | 1.812714 | -1.245174 | -0.613802 | 1.465176 | -1.442102 | 0.345435 | -0.248346 | -0.164198 | -0.013879 | -0.502754 | -0.111269 | -0.157939 |
3 | Coby White | -2.888972 | 0.151796 | -0.401778 | 0.659674 | 0.342884 | 0.095026 | 0.584003 | -0.854632 | -0.088465 | 0.647163 | 0.234574 | 0.269422 | -0.047146 |
4 | Dorian Finney-Smith | 1.212326 | -0.040621 | -2.433643 | 0.881384 | -1.346852 | -0.250738 | -0.215710 | -1.336844 | 0.337995 | 0.558351 | -0.081923 | 0.766397 | -0.000821 |
Next, just for our own knowledge we will take a look at the explained variance ratio of each of these components to see how they contribute to the total 89% explained variance:
print(pca.explained_variance_ratio_)
print(sum(pca.explained_variance_ratio_))
[0.23260274 0.20665526 0.12837104 0.05918937 0.04872676 0.04139613
0.03207 0.02928198 0.02700922 0.02343022 0.02193331 0.02109711
0.01686907]
0.8886322192886177
Continuing our analysis we are now ready to start evaluating how many clusters we will want to break the player pool into. Our metric to evaluate this is going to be the Silhouette Score. In short, the silhouette sore basically evaluates how similar an item in a cluster is to other members of the same cluster, and how different it is from members of other clusters. The silhouette score can range from [-1,1], with a perfect score being 1, and the worst possible score being -1. A score of 0 would indicate that the clusters overlap pretty evenly, so we want to be at a minimum greater than 0, and ideally as close to 1 as possible.
So in order to evaluate the different clusters we are going to do a very similar process to identifying how many principal components to use, we will loop over our dataset N times, run a clustering analysis with N clusters and review the silhouette score for each loop.
x = np.column_stack((pca_df['PC 1'], pca_df['PC 2'], pca_df['PC 3'], pca_df['PC 4'], pca_df['PC 5'],
pca_df['PC 6'], pca_df['PC 7'], pca_df['PC 8'], pca_df['PC 9'], pca_df['PC 10'], pca_df['PC 11'],
pca_df['PC 12'], pca_df['PC 13']))
silhouette = []
for n_clusters in range(2, 20):
kmeans = KMeans(n_clusters = n_clusters, random_state = 99)
cluster_labels = kmeans.fit_predict(x)
centers = kmeans.cluster_centers_
score = silhouette_score(x, cluster_labels)
silhouette.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
For n_clusters = 2, silhouette score is 0.215000869466846
For n_clusters = 3, silhouette score is 0.2169974785847639
For n_clusters = 4, silhouette score is 0.20584173489771507
For n_clusters = 5, silhouette score is 0.2000744279188773
For n_clusters = 6, silhouette score is 0.17902558778290453
For n_clusters = 7, silhouette score is 0.16956605623506713
For n_clusters = 8, silhouette score is 0.1569555934107893
For n_clusters = 9, silhouette score is 0.15991829037393543
For n_clusters = 10, silhouette score is 0.14333044540082515
For n_clusters = 11, silhouette score is 0.1349647907363549
For n_clusters = 12, silhouette score is 0.12399063929260486
For n_clusters = 13, silhouette score is 0.12660085033254045
For n_clusters = 14, silhouette score is 0.13546326325528482
For n_clusters = 15, silhouette score is 0.1358449718394935
For n_clusters = 16, silhouette score is 0.12811274408179157
For n_clusters = 17, silhouette score is 0.12896832387107143
For n_clusters = 18, silhouette score is 0.12408560098941573
For n_clusters = 19, silhouette score is 0.13023261356005114
plt.style.use('fivethirtyeight')
silhouette_fig, ax = plt.subplots()
ax.plot(range(2, 20), silhouette)
ax.set_xlabel('Number of clusters')
ax.set_ylabel('Silhouette score')
ax.set_xticks(np.arange(2, 21, 3.0))
silhouette_fig.suptitle("Identifying Optimal Cluster #", weight = 'bold', size = 18)
silhouette_fig.savefig('silhouette-score.png', dpi = 400, bbox_inches = 'tight')
Now, here is where you have to stop and think a little bit. Clearly the ‘best’ number of clusters falls with a very small number, but you want to balance the good distinctions with actionable information. We are trying to get more granular than the typical 5 positions here, so you may want to use a minimum of 6, OR, you may think that you just want to break into ball handlers, wings, and bigs, so 3 may be good. However unless you are manually creating your clusters you can’t be positive how it will be broken up. Personally, I like to have around 10+ clusters, so I’m going to do a little more analysis here to find the best relative cluster amount. To do this I’m going to look at the percent improvement from cluster to cluster to find the ideal cluster relative to the clusters surrounding it.
silhouette_diff = []
for i in range(1, len(silhouette)):
improvement = 1 - ((1 - silhouette[i]) / (1 - silhouette[i - 1]))
silhouette_diff.append(improvement)
print("For n_cluster = {}, percent improvement = {}".format(i + 2, improvement))
For n_cluster = 3, percent improvement = 0.0025434539227603414
For n_cluster = 4, percent improvement = -0.014247391779639962
For n_cluster = 5, percent improvement = -0.007262163264264432
For n_cluster = 6, percent improvement = -0.026313498243606848
For n_cluster = 7, percent improvement = -0.011522322020111941
For n_cluster = 8, percent improvement = -0.015185389420747653
For n_cluster = 9, percent improvement = 0.0035142833995336353
For n_cluster = 10, percent improvement = -0.019745513779241497
For n_cluster = 11, percent improvement = -0.009765322719311964
For n_cluster = 12, percent improvement = -0.012686363891582841
For n_cluster = 13, percent improvement = 0.0029796611280816787
For n_cluster = 14, percent improvement = 0.010147036353445826
For n_cluster = 15, percent improvement = 0.00044151806162218143
For n_cluster = 16, percent improvement = -0.008947732184306334
For n_cluster = 17, percent improvement = 0.000981296358528505
For n_cluster = 18, percent improvement = -0.005605677744529025
For n_cluster = 19, percent improvement = 0.007017823405550683
plt.style.use('fivethirtyeight')
silhouette_imp_fig, ax = plt.subplots()
ax.plot(range(2, 19), silhouette_diff)
ax.set_xlabel('Number of clusters')
ax.set_ylabel('% silhouette improvement')
ax.set_xticks(np.arange(2, 19, 2.0))
vals = ax.get_yticks()
ax.set_yticklabels(['{:,.0%}'.format(x) for x in vals])
silhouette_imp_fig.suptitle("Identifying Optimal Cluster #", weight = 'bold', size = 17)
As we can see here, their are 2 big spikes in percent improvement at 8 and 13. I personally like 13 as it’s my lucky number, and kind of makes sense since we have 13 principal components that 13 clusters would be a decent guess, but that’s not the driver here. There is definitely room for interpretation on how many clusters you use and your reasons for doing so, and I encourage you to experiment and find what works for you!
Next we will want to assign our players their clusters in a dataframe, and export to excel so we can review and think about how we want to tweak things.
kmeans = KMeans(n_clusters = 13, random_state = 1)
kmeans.fit(x)
df_cluster = pd.DataFrame()
df_cluster['Player'] = dfPlayerCol['Player']
df_cluster['Cluster'] = y_kmeans
pd.DataFrame.to_excel(df_cluster, "playerClusterNew2020.xlsx")