High Ceilings and Floors Part Two
Now that we have reviewed what the standard deviation is, and how we can use it and it's relationship to the mean to analyze the odds of a player scoring within a certain range of points, we can start to actually apply it.
First up, we are going to import the pandas package, and pull in our dataset we have been using for most of our analysis. Box scores for the 2019-2020 season pulled in February of 2020.
import pandas as pd
df = pd.read_excel(r"C:\Users\nfwya\OneDrive\Code Backup\nba\boxScores\20200215_Pandas.xlsx")
df.head(5)
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Anfernee Simons | POR @ NYK | 01/01/2020 | 26 | 3 | 3 | 3 | 1 | 0 | 1 | 13.1 |
1 | Anthony Davis | LAL vs. PHX | 01/01/2020 | 38 | 26 | 11 | 2 | 0 | 1 | 3 | 42.2 |
2 | Anthony Tolliver | POR @ NYK | 01/01/2020 | 23 | 3 | 4 | 1 | 0 | 1 | 1 | 11.3 |
3 | Aron Baynes | PHX @ LAL | 01/01/2020 | 21 | 4 | 5 | 2 | 0 | 1 | 0 | 16.0 |
4 | Avery Bradley | LAL vs. PHX | 01/01/2020 | 30 | 18 | 5 | 1 | 3 | 0 | 1 | 33.5 |
Next we are going to go ahead and review how we previously calculated the players season average of fantasy points (at least up until that date). I'm not going to step through this with explanations for you as we have used it before, but just as a refresher here is the code.
df2 = df[['Player', 'FP']]
df3= df2.groupby('Player',sort=False).mean().reset_index()
df3.head(5)
Player | FP | |
---|---|---|
0 | Anfernee Simons | 14.944444 |
1 | Anthony Davis | 52.032609 |
2 | Anthony Tolliver | 10.405714 |
3 | Aron Baynes | 22.363636 |
4 | Avery Bradley | 14.587179 |
seasonAveDict = {}
seasonAveDict.update(pd.Series(df3.FP.values, index=df3.Player).to_dict())
df['SeasonAveOLD'] = df['Player']
df.replace({'SeasonAveOLD': seasonAveDict}, inplace=True)
df.head(5)
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | SeasonAveOLD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Anfernee Simons | POR @ NYK | 01/01/2020 | 26 | 3 | 3 | 3 | 1 | 0 | 1 | 13.1 | 14.944444 |
1 | Anthony Davis | LAL vs. PHX | 01/01/2020 | 38 | 26 | 11 | 2 | 0 | 1 | 3 | 42.2 | 52.032609 |
2 | Anthony Tolliver | POR @ NYK | 01/01/2020 | 23 | 3 | 4 | 1 | 0 | 1 | 1 | 11.3 | 10.405714 |
3 | Aron Baynes | PHX @ LAL | 01/01/2020 | 21 | 4 | 5 | 2 | 0 | 1 | 0 | 16.0 | 22.363636 |
4 | Avery Bradley | LAL vs. PHX | 01/01/2020 | 30 | 18 | 5 | 1 | 3 | 0 | 1 | 33.5 | 14.587179 |
Now we can see the mean in the field of 'SeasonAveOLD' in our dataframe. There is another method we can use to calculate both the mean and standard deviation at once, but we will need a second field besides player name to group by. To do this we will be using the team the player plays for. However, you may notice that we don't have a field specifically for the player's team. We do have the match up though, and if you look closely, the player's team is always the first team listed. So we will use the same method we utilized when cleaning up our lineup outputs here to pull out the team abbreviation.
We are going to use a lambda function to split the 'Match_Up' field on the first ' ' in the string, and return the first segment of that list.
teamFxn = lambda x: x['Match_Up'].split(" ")[0]
Next we will apply that function to a new column titled 'Team'. This should return only the player's team in the 'Team' column
df['Team'] = df.apply(teamFxn, axis=1)
df.head(5)
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | SeasonAveOLD | Team | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Anfernee Simons | POR @ NYK | 01/01/2020 | 26 | 3 | 3 | 3 | 1 | 0 | 1 | 13.1 | 14.944444 | POR |
1 | Anthony Davis | LAL vs. PHX | 01/01/2020 | 38 | 26 | 11 | 2 | 0 | 1 | 3 | 42.2 | 52.032609 | LAL |
2 | Anthony Tolliver | POR @ NYK | 01/01/2020 | 23 | 3 | 4 | 1 | 0 | 1 | 1 | 11.3 | 10.405714 | POR |
3 | Aron Baynes | PHX @ LAL | 01/01/2020 | 21 | 4 | 5 | 2 | 0 | 1 | 0 | 16.0 | 22.363636 | PHX |
4 | Avery Bradley | LAL vs. PHX | 01/01/2020 | 30 | 18 | 5 | 1 | 3 | 0 | 1 | 33.5 | 14.587179 | LAL |
Doing a quick check here we can see both formats of matchups with '@' and 'vs.' and in both cases the correct team is returned.
Next up we are going to create a new dataframe called 'dfStat' and it will consist of the player name and team name, then we will be aggregating the Fantasy Points by player/team index and return the mean and standard deviation as separate columns using the .agg function
dfStat = df.groupby(['Player', 'Team']).FP.agg({'mean', 'std'})
dfStat.head(5)
mean | std | ||
---|---|---|---|
Player | Team | ||
Aaron Gordon | ORL | 29.914286 | 9.884183 |
Aaron Holiday | IND | 19.195918 | 10.805399 |
Abdel Nader | OKC | 11.038235 | 6.681363 |
Al Horford | PHI | 30.150000 | 8.966906 |
Al-Farouq Aminu | ORL | 16.170588 | 9.469937 |
Now, you can tell the dual index by the fact that both player and team are in bold type in the dataframe. Unfortunately none of those players in that preview are in the preview of our previous dataframe, so we will go ahead and combine this dataframe back into our original one using the merge function on the player, team index to ensure the correct numbers are being attributed to the correct player. After that we will just sort by game date for easier viewing.
df = df.merge(dfStat, left_on=['Player', 'Team'], right_index=True)
df = df.sort_values('Game_Date')
df.head(5)
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | SeasonAveOLD | Team | mean | std | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Anfernee Simons | POR @ NYK | 01/01/2020 | 26 | 3 | 3 | 3 | 1 | 0 | 1 | 13.1 | 14.944444 | POR | 14.944444 | 7.667123 |
10 | CJ McCollum | POR @ NYK | 01/01/2020 | 32 | 17 | 4 | 4 | 0 | 0 | 0 | 27.8 | 33.944231 | POR | 33.944231 | 10.935020 |
46 | Kyle Korver | MIL vs. MIN | 01/01/2020 | 18 | 8 | 4 | 2 | 0 | 1 | 3 | 15.8 | 12.036957 | MIL | 12.036957 | 6.266112 |
47 | Kyle Kuzma | LAL vs. PHX | 01/01/2020 | 27 | 19 | 4 | 1 | 1 | 0 | 0 | 28.3 | 20.513636 | LAL | 20.513636 | 9.673642 |
48 | LeBron James | LAL vs. PHX | 01/01/2020 | 38 | 31 | 13 | 12 | 2 | 1 | 5 | 68.6 | 51.711765 | LAL | 51.711765 | 9.848099 |
Z-Score
Now we've talked previously about why how many standard deviations away from the mean we are is relevant, but we've been looking at it in a pretty big picture way. One way we can start to drill down and get a little more information from this is with the z-score. The z-score is simply a measurement of exactly how many standard deviations away from the mean a score is. This is calculatred pretty simply by taking the score in question, minus that players mean, then taking the result of that calculation and dividing it by that players standard deviation
Z-Score = ( (Fantasy Points) - (Mean) ) / (Standard Deviation)
To take a look at this, we will go ahead and add a new column to our dataframe using the previous formula.
df['Z'] = (df['FP'] - df['mean'])/df['std']
df.head(5)
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | SeasonAveOLD | Team | mean | std | Z | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Anfernee Simons | POR @ NYK | 01/01/2020 | 26 | 3 | 3 | 3 | 1 | 0 | 1 | 13.1 | 14.944444 | POR | 14.944444 | 7.667123 | -0.240565 |
10 | CJ McCollum | POR @ NYK | 01/01/2020 | 32 | 17 | 4 | 4 | 0 | 0 | 0 | 27.8 | 33.944231 | POR | 33.944231 | 10.935020 | -0.561886 |
46 | Kyle Korver | MIL vs. MIN | 01/01/2020 | 18 | 8 | 4 | 2 | 0 | 1 | 3 | 15.8 | 12.036957 | MIL | 12.036957 | 6.266112 | 0.600539 |
47 | Kyle Kuzma | LAL vs. PHX | 01/01/2020 | 27 | 19 | 4 | 1 | 1 | 0 | 0 | 28.3 | 20.513636 | LAL | 20.513636 | 9.673642 | 0.804905 |
48 | LeBron James | LAL vs. PHX | 01/01/2020 | 38 | 31 | 13 | 12 | 2 | 1 | 5 | 68.6 | 51.711765 | LAL | 51.711765 | 9.848099 | 1.714873 |
One way to think of the Z-score graphically, is that it is the X-value for any point plotted on the bell curve we used previously. If you normalize a player's scores so that the mean is 0 and the standard deviation is 1, this Z score would be the the X coordinate of the point forming the curve.
We aren't going to go down that path though as I don't think it's terribly useful, and will bring in a lot of unnecessary information that doesn't add much to our problem here. For now, let's just take the z-score for what it is, a one stop shop to understand how mathematically likely or unlikely a player's performance was given their body of work.
Relating Our Statistics Back To The Bell Curve
Now just as a quick exercise to relate back to our bell curve visualization before, we are going to add one more column grouping together our Z-scores into the buckets we established on the bell curve, within 1 standard deviation from the mean, between 1 and 2 standard deviations, etc.
df.loc[abs(df['Z']) <= 4, 'whereBell'] = 4
df.loc[abs(df['Z']) <= 3, 'whereBell'] = 3
df.loc[abs(df['Z']) <= 2, 'whereBell'] = 2
df.loc[abs(df['Z']) <= 1, 'whereBell'] = 1
df.sort_values('whereBell', ascending=False).head(5)
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | SeasonAveOLD | Team | mean | std | Z | whereBell | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11695 | Duncan Robinson | MIA vs. ATL | 12/10/2019 | 43 | 34 | 3 | 4 | 0 | 1 | 0 | 46.6 | 20.439623 | MIA | 20.439623 | 8.244796 | 3.172956 | 4.0 |
5943 | Cody Zeller | CHA @ LAL | 10/27/2019 | 29 | 19 | 14 | 1 | 3 | 3 | 0 | 55.3 | 23.750000 | CHA | 23.750000 | 10.027355 | 3.146393 | 4.0 |
8145 | JaVale McGee | LAL vs. GSW | 11/13/2019 | 28 | 18 | 17 | 2 | 3 | 3 | 2 | 57.4 | 20.578431 | LAL | 20.578431 | 9.663091 | 3.810537 | 4.0 |
7434 | D'Angelo Russell | GSW @ MIN | 11/08/2019 | 40 | 52 | 9 | 5 | 3 | 2 | 3 | 82.3 | 38.160000 | GSW | 37.981818 | 13.449918 | 3.295052 | 4.0 |
10364 | Ben McLemore | HOU vs. ATL | 11/30/2019 | 34 | 24 | 13 | 1 | 1 | 0 | 1 | 43.1 | 16.598077 | HOU | 16.598077 | 8.460971 | 3.132255 | 4.0 |
Now, we are going to run a quick analysis to see how many instances actually fall into each bucket and how that compares to the bell curve numbers. Keep in mind that this is not a standardized dataset, so it likely won't match up exactly, but that doesn't mean we won't be able to learn something valuable from the data.
df2 = df[['Player', 'FP', 'whereBell']].groupby(['Player', 'whereBell']).count()
df2.head()
FP | ||
---|---|---|
Player | whereBell | |
Aaron Gordon | 1.0 | 32 |
2.0 | 15 | |
3.0 | 2 | |
Aaron Holiday | 1.0 | 30 |
2.0 | 18 |
Now this is a multi-level dataframe. In order to work with it a little easier we are going to 'unstack' the dataframe so we can understand it a little better
df3 = df2.unstack()
df3.head()
FP | ||||
---|---|---|---|---|
whereBell | 1.0 | 2.0 | 3.0 | 4.0 |
Player | ||||
Aaron Gordon | 32.0 | 15.0 | 2.0 | NaN |
Aaron Holiday | 30.0 | 18.0 | 1.0 | NaN |
Abdel Nader | 24.0 | 8.0 | 2.0 | NaN |
Al Horford | 36.0 | 11.0 | 3.0 | NaN |
Al-Farouq Aminu | 12.0 | 5.0 | NaN | NaN |
Ah, much better. Now we can see how many instances each player has within each bucket up to approximately 50 games through the season. Now, to take a big picture look at the data we are just going run the simple .info() function to take a look at how many instances are in each bucket across the league
df3.info()
<class 'pandas.core.frame.DataFrame'>
Index: 340 entries, Aaron Gordon to Zach LaVine
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (FP, 1.0) 340 non-null float64
1 (FP, 2.0) 339 non-null float64
2 (FP, 3.0) 285 non-null float64
3 (FP, 4.0) 34 non-null float64
dtypes: float64(4)
memory usage: 13.3+ KB
WOW. Across ~50 games for each team, 8-ish players playing decent minutes each game, only 34 TOTAL instances of a player scoring within the 3-4 standard deviations of their mean, and that's in both the positive and minus direction. Let's do a quick breakdown of how those percentages per bucket work out compared to the bell curve percentages we looked at before. We're just going to do a quick calculation of how many total instances there are, then divide each bucket by the total and compare to the bell curve
340+339+285+34
998
print(f'Within 1 STD of mean: {round(340/998*100,2)}%, bell curve = 64%')
print(f'Between 1 and 2 STD of mean: {round(339/998*100,2)}%, bell curve = 28%')
print(f'Between 2 and 3 STD of mean: {round(285/998*100,2)}%, bell curve = 4%')
print(f'Between 3 and 4 STD of mean: {round(34/998*100,2)}%, bell curve = 0.2%')
Within 1 STD of mean: 34.07%, bell curve = 64%
Between 1 and 2 STD of mean: 33.97%, bell curve = 28%
Between 2 and 3 STD of mean: 28.56%, bell curve = 4%
Between 3 and 4 STD of mean: 3.41%, bell curve = 0.2%
As we can see, the numbers are pretty well off across the board. Which makes sense, as this is not a standardized dataset, and that's OKAY. We can still learn quite a bit from this data. For instance, 26% of all instances fall between 2 and 3 standard deviations of the mean, now that could be in a good way or a bad way as we aren't separating out by positive or negative z scores here.
Next up we will do the exact same analysis for only Aaron Gordon, and see if the numbers are closer to the bell curve breakdown than they are for the entire NBA.
df2.head()
FP | ||
---|---|---|
Player | whereBell | |
Aaron Gordon | 1.0 | 32 |
2.0 | 15 | |
3.0 | 2 | |
Aaron Holiday | 1.0 | 30 |
2.0 | 18 |
32+15+2
49
print(f'Within 1 STD of mean: {round(32/49*100,2)}%, bell curve = 64%')
print(f'Between 1 and 2 STD of mean: {round(15/49*100,2)}%, bell curve = 28%')
print(f'Between 2 and 3 STD of mean: {round(2/49*100,2)}%, bell curve = 4%')
print(f'Between 3 and 4 STD of mean: {round(0/49*100,2)}%, bell curve = 0.2%')
Within 1 STD of mean: 65.31%, bell curve = 64%
Between 1 and 2 STD of mean: 30.61%, bell curve = 28%
Between 2 and 3 STD of mean: 4.08%, bell curve = 4%
Between 3 and 4 STD of mean: 0.0%, bell curve = 0.2%
As we can see here, Aaron Gordon's numbers are MUCH closer to what we could 'expect' with a normal distribution. And you know what, that's not necessarily better or worse than being way off! The point of this exercise isn't to prove that using the standard deviation/bell curve is the BEST or WORST way to help with lineup construction and player pool, but rather to do some hands on learning with it and understand HOW we can use it. In this example with Aaron Gordon, we would need to investigate a little further into his stats to figure out if we think he would fit better as a high floor or ceiling player, and the first thing we would want to look at is what his actual standard deviation is. If he has a fairly high standard deviation then maybe having 64% of his games within 1 STD of his mean really doesn't give us that much information. Let's take a look and see.
df.sort_values('Player').head()
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | SeasonAveOLD | Team | mean | std | Z | whereBell | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8950 | Aaron Gordon | ORL @ TOR | 11/20/2019 | 12 | 2 | 2 | 0 | 0 | 0 | 0 | 4.4 | 29.914286 | ORL | 29.914286 | 9.884183 | -2.581325 | 3.0 |
2575 | Aaron Gordon | ORL vs. OKC | 01/22/2020 | 35 | 14 | 8 | 3 | 1 | 0 | 0 | 31.1 | 29.914286 | ORL | 29.914286 | 9.884183 | 0.119961 | 1.0 |
5751 | Aaron Gordon | ORL @ ATL | 10/26/2019 | 30 | 10 | 7 | 3 | 2 | 0 | 1 | 27.9 | 29.914286 | ORL | 29.914286 | 9.884183 | -0.203789 | 1.0 |
4247 | Aaron Gordon | ORL @ BOS | 02/05/2020 | 31 | 23 | 10 | 4 | 2 | 1 | 5 | 45.0 | 29.914286 | ORL | 29.914286 | 9.884183 | 1.526248 | 2.0 |
10705 | Aaron Gordon | ORL @ WAS | 12/03/2019 | 32 | 18 | 11 | 2 | 0 | 0 | 0 | 34.2 | 29.914286 | ORL | 29.914286 | 9.884183 | 0.433593 | 1.0 |
In Conclusion
Now that you have a taste for looking into these statistics, that's going to be all for Part 2. We will dive deeper into how we can/want to use this type of analysis for player pool in Part 3! Do you think Aaron Gordon would be better classified as a high ceiling or floor player? Is 9.88 TOO high of a standard deviation? Or do you think that's pretty consistent? Let me know over in the youtube comments.