Analyzing Data with Pandas

Code as .ipnyb

Box Scores Dataset

Today we are going to be introducing how to calculate and use rolling averages in our Pandas Dataframes. To do this we are going to need a large dataset of box scores over at least a few weeks worth of games, if you do not have a dataset to accomplish this, go ahead and check out my webscraping tutorials for instructions on how to obtain one.

Web Scraping Playlist

Now that we have a dataset to use, I will be using a dataset containing all NBA games from the 2019-2020 season up until 02/13/2020. You don’t have to use the same dataset, but just be aware if you are using a different one then the numbers shown here will likely differ from your own.

Okay then, first things first we need to import pandas and load our dataset into a dataframe.

import pandas as pd
df = pd.read_excel(r"insert your dataset file here")

Next we are just going to use the head function to take a quick peek at our data and make sure it looks fine. Also, this is a dataset we previously used in a Pandas tutorial, so the fanduel fantasy points scored are already incorporated. If you are unsure about how to perform this calculation, please check out my previous Pandas tutorials for instructions on how to calculate this.

Previous Pandas Tutorial

df.head()
Player Match_Up Game_Date MIN PTS REB AST STL BLK TOV FP
0 Terrance Ferguson OKC @ NOP 02/13/2020 20 3 1 2 1 0 0 10.2
1 Danilo Gallinari OKC @ NOP 02/13/2020 33 29 2 1 2 0 2 36.9
2 Gordon Hayward BOS vs. LAC 02/13/2020 48 21 13 4 1 1 0 48.6
3 Enes Kanter BOS vs. LAC 02/13/2020 20 8 8 2 0 1 2 21.6
4 Jaxson Hayes NOP vs. OKC 02/13/2020 8 4 2 3 0 0 0 10.9

Now that we’ve taken a look at the data, we are going to need to convert the Game_Date field to an actual datetime data type for Pandas to be able to sort the games by date properly. Pandas by default reads this field in as a generic string value, so when we try to sort the games to get a proper rolling average, the games will just be sorted January-December or December-January rather than October-February.

df['Game_Date'] = df['Game_Date'].astype('datetime64')

Due to the nature of the rolling function, we will want to resort our data first by player name, then by game date in order to get the correct results. This is because the rolling function will look to the index value to ‘order’ the player fp scores, so even if we are looking at the games in chronological order, if the index contradicts this, then our rolling average window will be inaccurate. To accomplish this we will make use of the ‘ignore_index’ parameter of the sort function. This will sort the data as we see fit, then reassign the index values based on this sorted function.

df = df.sort_values(['Player', 'Game_Date'], ascending= [True, True], ignore_index=True)

And once again, we will utilize the head function to take a peak at our data to make sure it is making sense, and doing what we expect it to do.

df.head(15)
Player Match_Up Game_Date MIN PTS REB AST STL BLK TOV FP
0 Aaron Gordon ORL vs. CLE 2019-10-23 29 9 2 3 2 1 1 23.9
1 Aaron Gordon ORL @ ATL 2019-10-26 30 10 7 3 2 0 1 27.9
2 Aaron Gordon ORL @ TOR 2019-10-28 30 2 5 5 0 2 3 18.5
3 Aaron Gordon ORL vs. NYK 2019-10-30 28 15 11 1 1 1 0 35.7
4 Aaron Gordon ORL vs. MIL 2019-11-01 28 11 4 3 0 0 1 19.3
5 Aaron Gordon ORL vs. DEN 2019-11-02 32 21 9 3 0 0 0 36.3
6 Aaron Gordon ORL @ OKC 2019-11-05 29 15 7 3 0 0 2 25.9
7 Aaron Gordon ORL @ DAL 2019-11-06 37 23 4 1 2 0 4 31.3
8 Aaron Gordon ORL vs. MEM 2019-11-08 32 17 4 2 1 0 2 25.8
9 Aaron Gordon ORL vs. IND 2019-11-10 30 13 2 3 0 1 1 21.9
10 Aaron Gordon ORL vs. PHI 2019-11-13 36 18 13 7 1 1 2 48.1
11 Aaron Gordon ORL vs. SAS 2019-11-15 36 14 8 4 0 1 2 30.6
12 Aaron Gordon ORL vs. WAS 2019-11-17 37 13 9 2 0 0 0 26.8
13 Aaron Gordon ORL @ TOR 2019-11-20 12 2 2 0 0 0 0 4.4
14 Aaron Gordon ORL vs. TOR 2019-11-29 32 8 8 2 2 1 0 29.6

Now we can see that the sort function worked as expected, with the data being sorted first by player name, A->Z, then by game date, Old->New.

We are ready now to employ the rolling function. To do this we are going to build a little more complex of a command to accomplish this. First off we need to utilize the groupby function to group the boxscores by player to ensure we are only taking into account games that each player played in their respective rolling average calculations, the we will add the rolling function, specifying the number of games we want to incorporate as the window parameter, and the minimum periods necessary to generate a score. I personally like to use a min_period of 1 so that there is always a value, even if the player has played less than 3 games in the time frame being used in the dataset. Next we direct the rolling function to the FP column, and tell the function that the mean is what we want calculated in the rolling function. After this, we can reset the index and drop the extra ‘player’ column from grouping the data.

df['Last3']= df.groupby('Player',sort=False).rolling(window=3, min_periods=1).FP.mean().reset_index(drop='Player')

Once the rolling average for the past 3 games has been calculated we can go ahead and re-sort the data by game date from newest to oldest, set the index back to player name and take a look at the results using the head function.

df = df.sort_values('Game_Date', ascending=False)
df = df.set_index('Player')
df.head(15)
Match_Up Game_Date MIN PTS REB AST STL BLK TOV FP Last3
Player
Jaxson Hayes NOP vs. OKC 2020-02-13 8 4 2 3 0 0 0 10.9 8.633333
Derrick Favors NOP vs. OKC 2020-02-13 20 8 9 2 0 0 1 20.8 23.466667
Lou Williams LAC @ BOS 2020-02-13 44 35 6 8 1 2 3 60.2 42.033333
Terrance Ferguson OKC @ NOP 2020-02-13 20 3 1 2 1 0 0 10.2 5.200000
Ivica Zubac LAC @ BOS 2020-02-13 14 7 4 1 0 1 1 15.3 22.566667
Nerlens Noel OKC @ NOP 2020-02-13 11 2 3 0 0 0 0 5.6 6.600000
Steven Adams OKC @ NOP 2020-02-13 28 11 11 2 1 3 1 38.2 31.800000
Kemba Walker BOS vs. LAC 2020-02-13 46 19 9 7 0 0 1 39.3 32.133333
Nicolo Melli NOP vs. OKC 2020-02-13 24 6 3 0 0 3 2 16.6 21.533333
E'Twaun Moore NOP vs. OKC 2020-02-13 20 2 4 2 1 1 1 14.8 12.600000
Danilo Gallinari OKC @ NOP 2020-02-13 33 29 2 1 2 0 2 36.9 32.500000
Josh Hart NOP vs. OKC 2020-02-13 27 7 8 3 0 0 1 20.1 23.566667
JJ Redick NOP vs. OKC 2020-02-13 31 24 2 3 2 0 2 34.9 33.033333
Kawhi Leonard LAC @ BOS 2020-02-13 46 28 11 4 1 1 3 50.2 48.066667
Daniel Theis BOS vs. LAC 2020-02-13 24 12 5 0 1 2 2 25.0 28.333333

Now that we can see the calculated rolling averages, lets go ahead and check a player on the list here, since it’s hard to verify without seeing multiple games from a single player. I will be choosing Terrance Ferguson from the above table for demonstration. You can use the .loc function to grab a subset of data from the dataframe.

dfT = df.loc['Terrance Ferguson']
dfT.head(15)
Match_Up Game_Date MIN PTS REB AST STL BLK TOV FP Last3
Player
Terrance Ferguson OKC @ NOP 2020-02-13 20 3 1 2 1 0 0 10.2 5.200000
Terrance Ferguson OKC vs. SAS 2020-02-11 18 0 0 1 0 0 0 1.5 4.933333
Terrance Ferguson OKC vs. BOS 2020-02-09 26 0 2 1 0 0 0 3.9 4.833333
Terrance Ferguson OKC vs. DET 2020-02-07 14 1 2 2 1 0 0 9.4 3.533333
Terrance Ferguson OKC vs. CLE 2020-02-05 11 0 1 0 0 0 0 1.2 5.233333
Terrance Ferguson OKC vs. MIA 2020-01-17 7 0 0 0 0 0 0 0.0 11.200000
Terrance Ferguson OKC vs. TOR 2020-01-15 21 2 0 1 3 1 1 14.5 14.600000
Terrance Ferguson OKC @ MIN 2020-01-13 28 14 3 1 0 0 0 19.1 14.166667
Terrance Ferguson OKC vs. LAL 2020-01-11 24 0 1 2 1 1 0 10.2 11.300000
Terrance Ferguson OKC vs. HOU 2020-01-09 22 6 1 2 0 2 3 13.2 10.900000
Terrance Ferguson OKC @ BKN 2020-01-07 40 7 0 1 0 1 1 10.5 11.400000
Terrance Ferguson OKC @ PHI 2020-01-06 21 6 0 0 1 0 0 9.0 11.966667
Terrance Ferguson OKC @ CLE 2020-01-04 26 13 1 1 0 0 1 14.7 9.366667
Terrance Ferguson OKC @ SAS 2020-01-02 29 2 1 0 2 1 0 12.2 5.833333
Terrance Ferguson OKC vs. DAL 2019-12-31 18 0 1 0 0 0 0 1.2 3.766667

Reviewing the most recent 15 games the calculation appears to have worked accordingly, but lets double check the first 15 games as well to see if the min_periods=1 parameter worked as well.

dfT.tail(15)
Match_Up Game_Date MIN PTS REB AST STL BLK TOV FP Last3
Player
Terrance Ferguson OKC @ POR 2019-11-27 25 9 1 0 2 0 1 15.2 14.866667
Terrance Ferguson OKC @ GSW 2019-11-25 27 6 0 1 0 2 0 13.5 12.700000
Terrance Ferguson OKC vs. LAL 2019-11-22 32 11 2 3 0 0 2 15.9 14.333333
Terrance Ferguson OKC @ LAL 2019-11-19 31 3 1 1 1 0 0 8.7 17.766667
Terrance Ferguson OKC @ LAC 2019-11-18 35 5 7 0 2 0 1 18.4 18.433333
Terrance Ferguson OKC vs. PHI 2019-11-15 39 19 1 2 1 0 0 26.2 16.433333
Terrance Ferguson OKC vs. MIL 2019-11-10 24 5 1 1 0 1 0 10.7 11.900000
Terrance Ferguson OKC vs. GSW 2019-11-09 20 8 2 2 0 0 1 12.4 10.500000
Terrance Ferguson OKC @ SAS 2019-11-07 21 6 3 0 0 1 0 12.6 9.900000
Terrance Ferguson OKC vs. ORL 2019-11-05 23 5 0 1 0 0 0 6.5 8.666667
Terrance Ferguson OKC vs. POR 2019-10-30 35 2 3 0 1 1 1 10.6 13.300000
Terrance Ferguson OKC @ HOU 2019-10-28 19 0 2 1 2 0 1 8.9 12.033333
Terrance Ferguson OKC vs. GSW 2019-10-27 30 13 2 4 0 0 1 20.4 10.700000
Terrance Ferguson OKC vs. WAS 2019-10-25 22 0 4 0 0 1 1 6.8 5.850000
Terrance Ferguson OKC @ UTA 2019-10-23 24 2 2 1 0 0 1 4.9 4.900000

Now looking at the first 15 games, everything seems to be working as expected, with the first game scoring the same as the Last3 variable as it should since it was the only game in that window, then the second game being an average of the first 2, and finally the 3 game average beginning after the 3rd game of the season.

Previous
Previous

High Ceilings And Floors Part One

Next
Next

Introduction to Pandas and Dataframes