Now that we have reviewed what the standard deviation is, and how we can use it and it's relationship to the mean to analyze the odds of a player scoring within a certain range of points, we can start to actually apply it.

First up, we are going to import the pandas package, and pull in our dataset we have been using for most of our analysis. Box scores for the 2019-2020 season pulled in February of 2020.

import pandas as pd

df = pd.read_excel(r"C:\Users\nfwya\OneDrive\Code Backup\nba\boxScores\20200215_Pandas.xlsx")

df.head(5)

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV	FP
0	Anfernee Simons	POR @ NYK	01/01/2020	26	3	3	3	1	0	1	13.1
1	Anthony Davis	LAL vs. PHX	01/01/2020	38	26	11	2	0	1	3	42.2
2	Anthony Tolliver	POR @ NYK	01/01/2020	23	3	4	1	0	1	1	11.3
3	Aron Baynes	PHX @ LAL	01/01/2020	21	4	5	2	0	1	0	16.0
4	Avery Bradley	LAL vs. PHX	01/01/2020	30	18	5	1	3	0	1	33.5

Next we are going to go ahead and review how we previously calculated the players season average of fantasy points (at least up until that date). I'm not going to step through this with explanations for you as we have used it before, but just as a refresher here is the code.

df2 = df[['Player', 'FP']]
df3= df2.groupby('Player',sort=False).mean().reset_index()
df3.head(5)

	Player	FP
0	Anfernee Simons	14.944444
1	Anthony Davis	52.032609
2	Anthony Tolliver	10.405714
3	Aron Baynes	22.363636
4	Avery Bradley	14.587179

seasonAveDict = {}
seasonAveDict.update(pd.Series(df3.FP.values, index=df3.Player).to_dict())

df['SeasonAveOLD'] = df['Player']
df.replace({'SeasonAveOLD': seasonAveDict}, inplace=True)
df.head(5)

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV	FP	SeasonAveOLD
0	Anfernee Simons	POR @ NYK	01/01/2020	26	3	3	3	1	0	1	13.1	14.944444
1	Anthony Davis	LAL vs. PHX	01/01/2020	38	26	11	2	0	1	3	42.2	52.032609
2	Anthony Tolliver	POR @ NYK	01/01/2020	23	3	4	1	0	1	1	11.3	10.405714
3	Aron Baynes	PHX @ LAL	01/01/2020	21	4	5	2	0	1	0	16.0	22.363636
4	Avery Bradley	LAL vs. PHX	01/01/2020	30	18	5	1	3	0	1	33.5	14.587179

Now we can see the mean in the field of 'SeasonAveOLD' in our dataframe. There is another method we can use to calculate both the mean and standard deviation at once, but we will need a second field besides player name to group by. To do this we will be using the team the player plays for. However, you may notice that we don't have a field specifically for the player's team. We do have the match up though, and if you look closely, the player's team is always the first team listed. So we will use the same method we utilized when cleaning up our lineup outputs here to pull out the team abbreviation.

We are going to use a lambda function to split the 'Match_Up' field on the first ' ' in the string, and return the first segment of that list.

teamFxn = lambda x: x['Match_Up'].split(" ")[0]

Next we will apply that function to a new column titled 'Team'. This should return only the player's team in the 'Team' column

df['Team'] = df.apply(teamFxn, axis=1)

df.head(5)

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV	FP	SeasonAveOLD	Team
0	Anfernee Simons	POR @ NYK	01/01/2020	26	3	3	3	1	0	1	13.1	14.944444	POR
1	Anthony Davis	LAL vs. PHX	01/01/2020	38	26	11	2	0	1	3	42.2	52.032609	LAL
2	Anthony Tolliver	POR @ NYK	01/01/2020	23	3	4	1	0	1	1	11.3	10.405714	POR
3	Aron Baynes	PHX @ LAL	01/01/2020	21	4	5	2	0	1	0	16.0	22.363636	PHX
4	Avery Bradley	LAL vs. PHX	01/01/2020	30	18	5	1	3	0	1	33.5	14.587179	LAL

Doing a quick check here we can see both formats of matchups with '@' and 'vs.' and in both cases the correct team is returned.

Next up we are going to create a new dataframe called 'dfStat' and it will consist of the player name and team name, then we will be aggregating the Fantasy Points by player/team index and return the mean and standard deviation as separate columns using the .agg function

dfStat = df.groupby(['Player', 'Team']).FP.agg({'mean', 'std'})

dfStat.head(5)

		mean	std
Player	Team
Aaron Gordon	ORL	29.914286	9.884183
Aaron Holiday	IND	19.195918	10.805399
Abdel Nader	OKC	11.038235	6.681363
Al Horford	PHI	30.150000	8.966906
Al-Farouq Aminu	ORL	16.170588	9.469937

Now, you can tell the dual index by the fact that both player and team are in bold type in the dataframe. Unfortunately none of those players in that preview are in the preview of our previous dataframe, so we will go ahead and combine this dataframe back into our original one using the merge function on the player, team index to ensure the correct numbers are being attributed to the correct player. After that we will just sort by game date for easier viewing.

df = df.merge(dfStat, left_on=['Player', 'Team'], right_index=True)

df = df.sort_values('Game_Date')

df.head(5)

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV	FP	SeasonAveOLD	Team	mean	std
0	Anfernee Simons	POR @ NYK	01/01/2020	26	3	3	3	1	0	1	13.1	14.944444	POR	14.944444	7.667123
10	CJ McCollum	POR @ NYK	01/01/2020	32	17	4	4	0	0	0	27.8	33.944231	POR	33.944231	10.935020
46	Kyle Korver	MIL vs. MIN	01/01/2020	18	8	4	2	0	1	3	15.8	12.036957	MIL	12.036957	6.266112
47	Kyle Kuzma	LAL vs. PHX	01/01/2020	27	19	4	1	1	0	0	28.3	20.513636	LAL	20.513636	9.673642
48	LeBron James	LAL vs. PHX	01/01/2020	38	31	13	12	2	1	5	68.6	51.711765	LAL	51.711765	9.848099

Z-Score

Now we've talked previously about why how many standard deviations away from the mean we are is relevant, but we've been looking at it in a pretty big picture way. One way we can start to drill down and get a little more information from this is with the z-score. The z-score is simply a measurement of exactly how many standard deviations away from the mean a score is. This is calculatred pretty simply by taking the score in question, minus that players mean, then taking the result of that calculation and dividing it by that players standard deviation

Z-Score = ( (Fantasy Points) - (Mean) ) / (Standard Deviation)

To take a look at this, we will go ahead and add a new column to our dataframe using the previous formula.

df['Z'] = (df['FP'] - df['mean'])/df['std']

df.head(5)

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV	FP	SeasonAveOLD	Team	mean	std	Z
0	Anfernee Simons	POR @ NYK	01/01/2020	26	3	3	3	1	0	1	13.1	14.944444	POR	14.944444	7.667123	-0.240565
10	CJ McCollum	POR @ NYK	01/01/2020	32	17	4	4	0	0	0	27.8	33.944231	POR	33.944231	10.935020	-0.561886
46	Kyle Korver	MIL vs. MIN	01/01/2020	18	8	4	2	0	1	3	15.8	12.036957	MIL	12.036957	6.266112	0.600539
47	Kyle Kuzma	LAL vs. PHX	01/01/2020	27	19	4	1	1	0	0	28.3	20.513636	LAL	20.513636	9.673642	0.804905
48	LeBron James	LAL vs. PHX	01/01/2020	38	31	13	12	2	1	5	68.6	51.711765	LAL	51.711765	9.848099	1.714873

One way to think of the Z-score graphically, is that it is the X-value for any point plotted on the bell curve we used previously. If you normalize a player's scores so that the mean is 0 and the standard deviation is 1, this Z score would be the the X coordinate of the point forming the curve.

We aren't going to go down that path though as I don't think it's terribly useful, and will bring in a lot of unnecessary information that doesn't add much to our problem here. For now, let's just take the z-score for what it is, a one stop shop to understand how mathematically likely or unlikely a player's performance was given their body of work.

Relating Our Statistics Back To The Bell Curve

Now just as a quick exercise to relate back to our bell curve visualization before, we are going to add one more column grouping together our Z-scores into the buckets we established on the bell curve, within 1 standard deviation from the mean, between 1 and 2 standard deviations, etc.

df.loc[abs(df['Z']) <= 4, 'whereBell'] = 4
df.loc[abs(df['Z']) <= 3, 'whereBell'] = 3
df.loc[abs(df['Z']) <= 2, 'whereBell'] = 2
df.loc[abs(df['Z']) <= 1, 'whereBell'] = 1
df.sort_values('whereBell', ascending=False).head(5)

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV	FP	SeasonAveOLD	Team	mean	std	Z	whereBell
11695	Duncan Robinson	MIA vs. ATL	12/10/2019	43	34	3	4	0	1	0	46.6	20.439623	MIA	20.439623	8.244796	3.172956	4.0
5943	Cody Zeller	CHA @ LAL	10/27/2019	29	19	14	1	3	3	0	55.3	23.750000	CHA	23.750000	10.027355	3.146393	4.0
8145	JaVale McGee	LAL vs. GSW	11/13/2019	28	18	17	2	3	3	2	57.4	20.578431	LAL	20.578431	9.663091	3.810537	4.0
7434	D'Angelo Russell	GSW @ MIN	11/08/2019	40	52	9	5	3	2	3	82.3	38.160000	GSW	37.981818	13.449918	3.295052	4.0
10364	Ben McLemore	HOU vs. ATL	11/30/2019	34	24	13	1	1	0	1	43.1	16.598077	HOU	16.598077	8.460971	3.132255	4.0

Now, we are going to run a quick analysis to see how many instances actually fall into each bucket and how that compares to the bell curve numbers. Keep in mind that this is not a standardized dataset, so it likely won't match up exactly, but that doesn't mean we won't be able to learn something valuable from the data.

df2 = df[['Player', 'FP', 'whereBell']].groupby(['Player', 'whereBell']).count()

df2.head()

		FP
Player	whereBell
Aaron Gordon	1.0	32
	2.0	15
	3.0	2
Aaron Holiday	1.0	30
Aaron Holiday	2.0	18

Now this is a multi-level dataframe. In order to work with it a little easier we are going to 'unstack' the dataframe so we can understand it a little better

df3 = df2.unstack()

df3.head()

	FP
whereBell	1.0	2.0	3.0	4.0
Player
Aaron Gordon	32.0	15.0	2.0	NaN
Aaron Holiday	30.0	18.0	1.0	NaN
Abdel Nader	24.0	8.0	2.0	NaN
Al Horford	36.0	11.0	3.0	NaN
Al-Farouq Aminu	12.0	5.0	NaN	NaN

Ah, much better. Now we can see how many instances each player has within each bucket up to approximately 50 games through the season. Now, to take a big picture look at the data we are just going run the simple .info() function to take a look at how many instances are in each bucket across the league

df3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 340 entries, Aaron Gordon to Zach LaVine
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   (FP, 1.0)  340 non-null    float64
 1   (FP, 2.0)  339 non-null    float64
 2   (FP, 3.0)  285 non-null    float64
 3   (FP, 4.0)  34 non-null     float64
dtypes: float64(4)
memory usage: 13.3+ KB

WOW. Across ~50 games for each team, 8-ish players playing decent minutes each game, only 34 TOTAL instances of a player scoring within the 3-4 standard deviations of their mean, and that's in both the positive and minus direction. Let's do a quick breakdown of how those percentages per bucket work out compared to the bell curve percentages we looked at before. We're just going to do a quick calculation of how many total instances there are, then divide each bucket by the total and compare to the bell curve

340+339+285+34

print(f'Within 1 STD of mean: {round(340/998*100,2)}%, bell curve = 64%')
print(f'Between 1 and 2 STD of mean: {round(339/998*100,2)}%, bell curve = 28%')
print(f'Between 2 and 3 STD of mean: {round(285/998*100,2)}%, bell curve = 4%')
print(f'Between 3 and 4 STD of mean: {round(34/998*100,2)}%, bell curve = 0.2%')

Within 1 STD of mean: 34.07%, bell curve = 64%
Between 1 and 2 STD of mean: 33.97%, bell curve = 28%
Between 2 and 3 STD of mean: 28.56%, bell curve = 4%
Between 3 and 4 STD of mean: 3.41%, bell curve = 0.2%

As we can see, the numbers are pretty well off across the board. Which makes sense, as this is not a standardized dataset, and that's OKAY. We can still learn quite a bit from this data. For instance, 26% of all instances fall between 2 and 3 standard deviations of the mean, now that could be in a good way or a bad way as we aren't separating out by positive or negative z scores here.

Next up we will do the exact same analysis for only Aaron Gordon, and see if the numbers are closer to the bell curve breakdown than they are for the entire NBA.

df2.head()

		FP
Player	whereBell
Aaron Gordon	1.0	32
	2.0	15
	3.0	2
Aaron Holiday	1.0	30
Aaron Holiday	2.0	18

32+15+2

print(f'Within 1 STD of mean: {round(32/49*100,2)}%, bell curve = 64%')
print(f'Between 1 and 2 STD of mean: {round(15/49*100,2)}%, bell curve = 28%')
print(f'Between 2 and 3 STD of mean: {round(2/49*100,2)}%, bell curve = 4%')
print(f'Between 3 and 4 STD of mean: {round(0/49*100,2)}%, bell curve = 0.2%')

Within 1 STD of mean: 65.31%, bell curve = 64%
Between 1 and 2 STD of mean: 30.61%, bell curve = 28%
Between 2 and 3 STD of mean: 4.08%, bell curve = 4%
Between 3 and 4 STD of mean: 0.0%, bell curve = 0.2%

As we can see here, Aaron Gordon's numbers are MUCH closer to what we could 'expect' with a normal distribution. And you know what, that's not necessarily better or worse than being way off! The point of this exercise isn't to prove that using the standard deviation/bell curve is the BEST or WORST way to help with lineup construction and player pool, but rather to do some hands on learning with it and understand HOW we can use it. In this example with Aaron Gordon, we would need to investigate a little further into his stats to figure out if we think he would fit better as a high floor or ceiling player, and the first thing we would want to look at is what his actual standard deviation is. If he has a fairly high standard deviation then maybe having 64% of his games within 1 STD of his mean really doesn't give us that much information. Let's take a look and see.

df.sort_values('Player').head()

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV	FP	SeasonAveOLD	Team	mean	std	Z	whereBell
8950	Aaron Gordon	ORL @ TOR	11/20/2019	12	2	2	0	0	0	0	4.4	29.914286	ORL	29.914286	9.884183	-2.581325	3.0
2575	Aaron Gordon	ORL vs. OKC	01/22/2020	35	14	8	3	1	0	0	31.1	29.914286	ORL	29.914286	9.884183	0.119961	1.0
5751	Aaron Gordon	ORL @ ATL	10/26/2019	30	10	7	3	2	0	1	27.9	29.914286	ORL	29.914286	9.884183	-0.203789	1.0
4247	Aaron Gordon	ORL @ BOS	02/05/2020	31	23	10	4	2	1	5	45.0	29.914286	ORL	29.914286	9.884183	1.526248	2.0
10705	Aaron Gordon	ORL @ WAS	12/03/2019	32	18	11	2	0	0	0	34.2	29.914286	ORL	29.914286	9.884183	0.433593	1.0

In Conclusion

Now that you have a taste for looking into these statistics, that's going to be all for Part 2. We will dive deeper into how we can/want to use this type of analysis for player pool in Part 3! Do you think Aaron Gordon would be better classified as a high ceiling or floor player? Is 9.88 TOO high of a standard deviation? Or do you think that's pretty consistent? Let me know over in the youtube comments.

High Ceilings and Floors Part Two

Z-Score

Relating Our Statistics Back To The Bell Curve

In Conclusion

High Floor and Ceiling Part 3

High Ceilings And Floors Part One