Analyzing Data with Pandas
Today we are going to be introducing how to calculate and use rolling averages in our Pandas Dataframes. To do this we are going to need a large dataset of box scores over at least a few weeks worth of games, if you do not have a dataset to accomplish this, go ahead and check out my webscraping tutorials for instructions on how to obtain one.
Now that we have a dataset to use, I will be using a dataset containing all NBA games from the 2019-2020 season up until 02/13/2020. You don’t have to use the same dataset, but just be aware if you are using a different one then the numbers shown here will likely differ from your own.
Okay then, first things first we need to import pandas and load our dataset into a dataframe.
import pandas as pd
df = pd.read_excel(r"insert your dataset file here")
Next we are just going to use the head function to take a quick peek at our data and make sure it looks fine. Also, this is a dataset we previously used in a Pandas tutorial, so the fanduel fantasy points scored are already incorporated. If you are unsure about how to perform this calculation, please check out my previous Pandas tutorials for instructions on how to calculate this.
df.head()
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Terrance Ferguson | OKC @ NOP | 02/13/2020 | 20 | 3 | 1 | 2 | 1 | 0 | 0 | 10.2 |
1 | Danilo Gallinari | OKC @ NOP | 02/13/2020 | 33 | 29 | 2 | 1 | 2 | 0 | 2 | 36.9 |
2 | Gordon Hayward | BOS vs. LAC | 02/13/2020 | 48 | 21 | 13 | 4 | 1 | 1 | 0 | 48.6 |
3 | Enes Kanter | BOS vs. LAC | 02/13/2020 | 20 | 8 | 8 | 2 | 0 | 1 | 2 | 21.6 |
4 | Jaxson Hayes | NOP vs. OKC | 02/13/2020 | 8 | 4 | 2 | 3 | 0 | 0 | 0 | 10.9 |
Now that we’ve taken a look at the data, we are going to need to convert the Game_Date field to an actual datetime data type for Pandas to be able to sort the games by date properly. Pandas by default reads this field in as a generic string value, so when we try to sort the games to get a proper rolling average, the games will just be sorted January-December or December-January rather than October-February.
df['Game_Date'] = df['Game_Date'].astype('datetime64')
Due to the nature of the rolling function, we will want to resort our data first by player name, then by game date in order to get the correct results. This is because the rolling function will look to the index value to ‘order’ the player fp scores, so even if we are looking at the games in chronological order, if the index contradicts this, then our rolling average window will be inaccurate. To accomplish this we will make use of the ‘ignore_index’ parameter of the sort function. This will sort the data as we see fit, then reassign the index values based on this sorted function.
df = df.sort_values(['Player', 'Game_Date'], ascending= [True, True], ignore_index=True)
And once again, we will utilize the head function to take a peak at our data to make sure it is making sense, and doing what we expect it to do.
df.head(15)
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aaron Gordon | ORL vs. CLE | 2019-10-23 | 29 | 9 | 2 | 3 | 2 | 1 | 1 | 23.9 |
1 | Aaron Gordon | ORL @ ATL | 2019-10-26 | 30 | 10 | 7 | 3 | 2 | 0 | 1 | 27.9 |
2 | Aaron Gordon | ORL @ TOR | 2019-10-28 | 30 | 2 | 5 | 5 | 0 | 2 | 3 | 18.5 |
3 | Aaron Gordon | ORL vs. NYK | 2019-10-30 | 28 | 15 | 11 | 1 | 1 | 1 | 0 | 35.7 |
4 | Aaron Gordon | ORL vs. MIL | 2019-11-01 | 28 | 11 | 4 | 3 | 0 | 0 | 1 | 19.3 |
5 | Aaron Gordon | ORL vs. DEN | 2019-11-02 | 32 | 21 | 9 | 3 | 0 | 0 | 0 | 36.3 |
6 | Aaron Gordon | ORL @ OKC | 2019-11-05 | 29 | 15 | 7 | 3 | 0 | 0 | 2 | 25.9 |
7 | Aaron Gordon | ORL @ DAL | 2019-11-06 | 37 | 23 | 4 | 1 | 2 | 0 | 4 | 31.3 |
8 | Aaron Gordon | ORL vs. MEM | 2019-11-08 | 32 | 17 | 4 | 2 | 1 | 0 | 2 | 25.8 |
9 | Aaron Gordon | ORL vs. IND | 2019-11-10 | 30 | 13 | 2 | 3 | 0 | 1 | 1 | 21.9 |
10 | Aaron Gordon | ORL vs. PHI | 2019-11-13 | 36 | 18 | 13 | 7 | 1 | 1 | 2 | 48.1 |
11 | Aaron Gordon | ORL vs. SAS | 2019-11-15 | 36 | 14 | 8 | 4 | 0 | 1 | 2 | 30.6 |
12 | Aaron Gordon | ORL vs. WAS | 2019-11-17 | 37 | 13 | 9 | 2 | 0 | 0 | 0 | 26.8 |
13 | Aaron Gordon | ORL @ TOR | 2019-11-20 | 12 | 2 | 2 | 0 | 0 | 0 | 0 | 4.4 |
14 | Aaron Gordon | ORL vs. TOR | 2019-11-29 | 32 | 8 | 8 | 2 | 2 | 1 | 0 | 29.6 |
Now we can see that the sort function worked as expected, with the data being sorted first by player name, A->Z, then by game date, Old->New.
We are ready now to employ the rolling function. To do this we are going to build a little more complex of a command to accomplish this. First off we need to utilize the groupby function to group the boxscores by player to ensure we are only taking into account games that each player played in their respective rolling average calculations, the we will add the rolling function, specifying the number of games we want to incorporate as the window parameter, and the minimum periods necessary to generate a score. I personally like to use a min_period of 1 so that there is always a value, even if the player has played less than 3 games in the time frame being used in the dataset. Next we direct the rolling function to the FP column, and tell the function that the mean is what we want calculated in the rolling function. After this, we can reset the index and drop the extra ‘player’ column from grouping the data.
df['Last3']= df.groupby('Player',sort=False).rolling(window=3, min_periods=1).FP.mean().reset_index(drop='Player')
Once the rolling average for the past 3 games has been calculated we can go ahead and re-sort the data by game date from newest to oldest, set the index back to player name and take a look at the results using the head function.
df = df.sort_values('Game_Date', ascending=False)
df = df.set_index('Player')
df.head(15)
Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | Last3 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Player | |||||||||||
Jaxson Hayes | NOP vs. OKC | 2020-02-13 | 8 | 4 | 2 | 3 | 0 | 0 | 0 | 10.9 | 8.633333 |
Derrick Favors | NOP vs. OKC | 2020-02-13 | 20 | 8 | 9 | 2 | 0 | 0 | 1 | 20.8 | 23.466667 |
Lou Williams | LAC @ BOS | 2020-02-13 | 44 | 35 | 6 | 8 | 1 | 2 | 3 | 60.2 | 42.033333 |
Terrance Ferguson | OKC @ NOP | 2020-02-13 | 20 | 3 | 1 | 2 | 1 | 0 | 0 | 10.2 | 5.200000 |
Ivica Zubac | LAC @ BOS | 2020-02-13 | 14 | 7 | 4 | 1 | 0 | 1 | 1 | 15.3 | 22.566667 |
Nerlens Noel | OKC @ NOP | 2020-02-13 | 11 | 2 | 3 | 0 | 0 | 0 | 0 | 5.6 | 6.600000 |
Steven Adams | OKC @ NOP | 2020-02-13 | 28 | 11 | 11 | 2 | 1 | 3 | 1 | 38.2 | 31.800000 |
Kemba Walker | BOS vs. LAC | 2020-02-13 | 46 | 19 | 9 | 7 | 0 | 0 | 1 | 39.3 | 32.133333 |
Nicolo Melli | NOP vs. OKC | 2020-02-13 | 24 | 6 | 3 | 0 | 0 | 3 | 2 | 16.6 | 21.533333 |
E'Twaun Moore | NOP vs. OKC | 2020-02-13 | 20 | 2 | 4 | 2 | 1 | 1 | 1 | 14.8 | 12.600000 |
Danilo Gallinari | OKC @ NOP | 2020-02-13 | 33 | 29 | 2 | 1 | 2 | 0 | 2 | 36.9 | 32.500000 |
Josh Hart | NOP vs. OKC | 2020-02-13 | 27 | 7 | 8 | 3 | 0 | 0 | 1 | 20.1 | 23.566667 |
JJ Redick | NOP vs. OKC | 2020-02-13 | 31 | 24 | 2 | 3 | 2 | 0 | 2 | 34.9 | 33.033333 |
Kawhi Leonard | LAC @ BOS | 2020-02-13 | 46 | 28 | 11 | 4 | 1 | 1 | 3 | 50.2 | 48.066667 |
Daniel Theis | BOS vs. LAC | 2020-02-13 | 24 | 12 | 5 | 0 | 1 | 2 | 2 | 25.0 | 28.333333 |
Now that we can see the calculated rolling averages, lets go ahead and check a player on the list here, since it’s hard to verify without seeing multiple games from a single player. I will be choosing Terrance Ferguson from the above table for demonstration. You can use the .loc function to grab a subset of data from the dataframe.
dfT = df.loc['Terrance Ferguson']
dfT.head(15)
Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | Last3 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Player | |||||||||||
Terrance Ferguson | OKC @ NOP | 2020-02-13 | 20 | 3 | 1 | 2 | 1 | 0 | 0 | 10.2 | 5.200000 |
Terrance Ferguson | OKC vs. SAS | 2020-02-11 | 18 | 0 | 0 | 1 | 0 | 0 | 0 | 1.5 | 4.933333 |
Terrance Ferguson | OKC vs. BOS | 2020-02-09 | 26 | 0 | 2 | 1 | 0 | 0 | 0 | 3.9 | 4.833333 |
Terrance Ferguson | OKC vs. DET | 2020-02-07 | 14 | 1 | 2 | 2 | 1 | 0 | 0 | 9.4 | 3.533333 |
Terrance Ferguson | OKC vs. CLE | 2020-02-05 | 11 | 0 | 1 | 0 | 0 | 0 | 0 | 1.2 | 5.233333 |
Terrance Ferguson | OKC vs. MIA | 2020-01-17 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 11.200000 |
Terrance Ferguson | OKC vs. TOR | 2020-01-15 | 21 | 2 | 0 | 1 | 3 | 1 | 1 | 14.5 | 14.600000 |
Terrance Ferguson | OKC @ MIN | 2020-01-13 | 28 | 14 | 3 | 1 | 0 | 0 | 0 | 19.1 | 14.166667 |
Terrance Ferguson | OKC vs. LAL | 2020-01-11 | 24 | 0 | 1 | 2 | 1 | 1 | 0 | 10.2 | 11.300000 |
Terrance Ferguson | OKC vs. HOU | 2020-01-09 | 22 | 6 | 1 | 2 | 0 | 2 | 3 | 13.2 | 10.900000 |
Terrance Ferguson | OKC @ BKN | 2020-01-07 | 40 | 7 | 0 | 1 | 0 | 1 | 1 | 10.5 | 11.400000 |
Terrance Ferguson | OKC @ PHI | 2020-01-06 | 21 | 6 | 0 | 0 | 1 | 0 | 0 | 9.0 | 11.966667 |
Terrance Ferguson | OKC @ CLE | 2020-01-04 | 26 | 13 | 1 | 1 | 0 | 0 | 1 | 14.7 | 9.366667 |
Terrance Ferguson | OKC @ SAS | 2020-01-02 | 29 | 2 | 1 | 0 | 2 | 1 | 0 | 12.2 | 5.833333 |
Terrance Ferguson | OKC vs. DAL | 2019-12-31 | 18 | 0 | 1 | 0 | 0 | 0 | 0 | 1.2 | 3.766667 |
Reviewing the most recent 15 games the calculation appears to have worked accordingly, but lets double check the first 15 games as well to see if the min_periods=1 parameter worked as well.
dfT.tail(15)
Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | Last3 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Player | |||||||||||
Terrance Ferguson | OKC @ POR | 2019-11-27 | 25 | 9 | 1 | 0 | 2 | 0 | 1 | 15.2 | 14.866667 |
Terrance Ferguson | OKC @ GSW | 2019-11-25 | 27 | 6 | 0 | 1 | 0 | 2 | 0 | 13.5 | 12.700000 |
Terrance Ferguson | OKC vs. LAL | 2019-11-22 | 32 | 11 | 2 | 3 | 0 | 0 | 2 | 15.9 | 14.333333 |
Terrance Ferguson | OKC @ LAL | 2019-11-19 | 31 | 3 | 1 | 1 | 1 | 0 | 0 | 8.7 | 17.766667 |
Terrance Ferguson | OKC @ LAC | 2019-11-18 | 35 | 5 | 7 | 0 | 2 | 0 | 1 | 18.4 | 18.433333 |
Terrance Ferguson | OKC vs. PHI | 2019-11-15 | 39 | 19 | 1 | 2 | 1 | 0 | 0 | 26.2 | 16.433333 |
Terrance Ferguson | OKC vs. MIL | 2019-11-10 | 24 | 5 | 1 | 1 | 0 | 1 | 0 | 10.7 | 11.900000 |
Terrance Ferguson | OKC vs. GSW | 2019-11-09 | 20 | 8 | 2 | 2 | 0 | 0 | 1 | 12.4 | 10.500000 |
Terrance Ferguson | OKC @ SAS | 2019-11-07 | 21 | 6 | 3 | 0 | 0 | 1 | 0 | 12.6 | 9.900000 |
Terrance Ferguson | OKC vs. ORL | 2019-11-05 | 23 | 5 | 0 | 1 | 0 | 0 | 0 | 6.5 | 8.666667 |
Terrance Ferguson | OKC vs. POR | 2019-10-30 | 35 | 2 | 3 | 0 | 1 | 1 | 1 | 10.6 | 13.300000 |
Terrance Ferguson | OKC @ HOU | 2019-10-28 | 19 | 0 | 2 | 1 | 2 | 0 | 1 | 8.9 | 12.033333 |
Terrance Ferguson | OKC vs. GSW | 2019-10-27 | 30 | 13 | 2 | 4 | 0 | 0 | 1 | 20.4 | 10.700000 |
Terrance Ferguson | OKC vs. WAS | 2019-10-25 | 22 | 0 | 4 | 0 | 0 | 1 | 1 | 6.8 | 5.850000 |
Terrance Ferguson | OKC @ UTA | 2019-10-23 | 24 | 2 | 2 | 1 | 0 | 0 | 1 | 4.9 | 4.900000 |
Now looking at the first 15 games, everything seems to be working as expected, with the first game scoring the same as the Last3 variable as it should since it was the only game in that window, then the second game being an average of the first 2, and finally the 3 game average beginning after the 3rd game of the season.