Introduction to Pandas and Dataframes
Box Scores Dataset
Hey everybody welcome back! Today we’re going to be taking our first in-depth look at the pandas data analysis package for python. It’s going to be a pretty high level introductory video as we are not going to be doing anything terribly complex today, just dipping our toes in the water and learning the basics. I know that I’ve used pandas several times in previous videos without really explaining much about it other than the fact that you manage data with it. An easy way to think of pandas is that it is a more pythonic version of Microsoft Excel. You handle tabular data with it, can perform mathematical functions, break out different elements of your data into more manageable subsets, and visualize your data as well.
First things first, we are going to want to pull up the documentation for pandas to provide us with a strong foundation.
https://pandas.pydata.org/docs/
We are fortunate with pandas as they have VERY user friendly documentation.
What is pandas?
Pandas allows you to work within what’s known as dataframes. A dataframe is really just a table of information, very similar to an excel worksheet, the information stored can be text, numerical, formulas, etc. Your dataframe is broken into columns (Series), and rows. Each row will have an indexed value at the far left of your dataframe, the default is to use numerical indexes starting with 0, but you can redefine one of your existing columns as your index if you prefer. For example, you may want to have your data indexed by player, rather than an arbitrary numerical value to make it easier to navigate and slice your data up into more meaningful subsets.
What is pandas good for??
You may be watching and reading this thinking yeah but I’m already an excel whiz, I have all x,y,z processes set up in excel and I REALLY don’t want to migrate all of that to pandas, so why I should I understand it?? That is a VERY valid question, and at the end of the day, you should use whatever tools you can be the most efficient with. Personally, I like to use pandas over excel when possible, I find it more agile, efficient, and has a broader toolset to work with. However even at it’s most basic functionality, pandas is easily the best file conversion tool I have ever come across. Pandas can read in data from at least 19 different file types, probably more that’s just all I am aware of off the top of my head, and can export into any of those same data types. This functionality allows you to take data that doesn’t play nice with excel, and allows you to convert it into a nice tabular format and export straight into an .xlsx file format for future use. Incredibly convenient!
Okay, back to NBA stats!
To start using pandas we will need to read in our data using the pandas read function. There are quite a few optional parameters you can use as a part of this function, I recommend checking out the documentation for it if you are wanting a more targeted upload of data. Examples include specifying which columns to bring into pandas from the original source, auto-converting values to a specific data type, etc.
import pandas as pd
df = pd.read_excel(r"C:\Users\nfwya\Fanduel\BoxScores\20200808clean.xlsx")
Once we have the dataframe created and we can see our data is all here, there are a few functions within pandas you will want to get familiar with, especially for statistical analysis, but these can be very useful for dataset with varying data types as well. First up is the head and tail commands. These commands allow you to view the first x records in your dataframe using head, and the last x records in your dataframe using tail. This can be helpful if you are working with a large dataset and don’t want to have to render the entire thing each time you need to check and see the results of your work. An easy example of how this is beneficial is if you are sorting based on different variables, especially if it is a more complex sort. You can quickly check to make sure the rightful values are at the beginning and end of your sorted dataframe as a sanity check without viewing the entire thing.
df.head()
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Timothe Luwawu-Cabarrot | BKN vs. SAC | 8/7/2020 | 27 | 10 | 6 | 0 | 0 | 0 | 0 |
1 | Lonzo Ball | NOP vs. WAS | 8/7/2020 | 31 | 2 | 9 | 5 | 1 | 3 | 3 |
2 | Caris LeVert | BKN vs. SAC | 8/7/2020 | 36 | 22 | 5 | 7 | 3 | 0 | 2 |
3 | Al Horford | PHI vs. ORL | 8/7/2020 | 30 | 21 | 9 | 2 | 0 | 1 | 1 |
4 | Terrence Ross | ORL @ PHI | 8/7/2020 | 26 | 0 | 3 | 2 | 2 | 1 | 0 |
Next up, you will want to get used to using the describe function. Checking the output of this function is often a first step I take in evaluating any numerical data source, as it can give me a single output that allows me to make sure that the number both make sense, and are what I was expecting. In addition to this, you can use the statistical values provided to aid in further analysis on a player by player or team by team level to give you a little more insight.
df.describe()
MIN | PTS | REB | AST | STL | BLK | TOV | |
---|---|---|---|---|---|---|---|
count | 19872.000000 | 19872.000000 | 19872.000000 | 19872.000000 | 19872.000000 | 19872.000000 | 19872.000000 |
mean | 24.548058 | 11.371679 | 4.548360 | 2.478915 | 0.777426 | 0.499396 | 1.407911 |
std | 8.836700 | 8.454256 | 3.461526 | 2.535612 | 0.974553 | 0.868078 | 1.434660 |
min | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 18.000000 | 5.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 25.000000 | 10.000000 | 4.000000 | 2.000000 | 1.000000 | 0.000000 | 1.000000 |
75% | 32.000000 | 16.000000 | 6.000000 | 4.000000 | 1.000000 | 1.000000 | 2.000000 |
max | 52.000000 | 61.000000 | 25.000000 | 19.000000 | 7.000000 | 10.000000 | 11.000000 |
Next up is the info function, the info function will give you a quick overview of the count of non-null values in that column, as well as the data type of that column. This isn’t terribly valuable right now, but if you are doing data prep or cleanup to feed into a model in the future, this will become an invaluable tool.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19872 entries, 0 to 19871
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Player 19872 non-null object
1 Match_Up 19872 non-null object
2 Game_Date 19872 non-null object
3 MIN 19872 non-null int64
4 PTS 19872 non-null int64
5 REB 19872 non-null int64
6 AST 19872 non-null int64
7 STL 19872 non-null int64
8 BLK 19872 non-null int64
9 TOV 19872 non-null int64
dtypes: int64(7), object(3)
memory usage: 1.5+ MB
In addition to these big picture analysis functions, it is also pretty easily to create a new dataframe of a subset of the data, for instance if you want to create a new datafame just to look at James Harden’s stats, you can simply define a new dataframe as a subset of the current dataframe as shown below:
HardenDF = df[df['Player'] == "James Harden"]
Now you can perform any functions as you performed on the full dataset, but just for James Harden by referencing the HardenDF dataframe instead of the df dataframe.
HardenDF.head()
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | |
---|---|---|---|---|---|---|---|---|---|---|
143 | James Harden | HOU vs. LAL | 8/6/2020 | 36 | 39 | 8 | 12 | 3 | 1 | 10 |
410 | James Harden | HOU @ POR | 8/4/2020 | 33 | 23 | 6 | 9 | 0 | 0 | 4 |
629 | James Harden | HOU vs. MIL | 8/2/2020 | 34 | 24 | 7 | 7 | 6 | 0 | 3 |
907 | James Harden | HOU @ DAL | 7/31/2020 | 43 | 49 | 9 | 8 | 3 | 3 | 1 |
1141 | James Harden | HOU vs. ORL | 3/8/2020 | 32 | 23 | 7 | 6 | 3 | 0 | 4 |
HardenDF.describe()
MIN | PTS | REB | AST | STL | BLK | TOV | |
---|---|---|---|---|---|---|---|
count | 65.000000 | 65.000000 | 65.000000 | 65.000000 | 65.000000 | 65.000000 | 65.000000 |
mean | 36.676923 | 34.323077 | 6.430769 | 7.476923 | 1.815385 | 0.876923 | 4.476923 |
std | 4.362868 | 10.894534 | 2.783709 | 2.652285 | 1.309727 | 0.943857 | 2.278347 |
min | 26.000000 | 12.000000 | 1.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 35.000000 | 28.000000 | 4.000000 | 6.000000 | 1.000000 | 0.000000 | 3.000000 |
50% | 37.000000 | 35.000000 | 6.000000 | 7.000000 | 2.000000 | 1.000000 | 4.000000 |
75% | 39.000000 | 41.000000 | 8.000000 | 9.000000 | 3.000000 | 1.000000 | 6.000000 |
max | 49.000000 | 60.000000 | 16.000000 | 14.000000 | 6.000000 | 3.000000 | 11.000000 |
It is also possible to create a new column and calculate a value based on the values in the existing dataframe. For instance if you wanted to add a column for fantasy points scored, you can create a new column, and add in the points variables for each statistic and sum it up. These scoring metrics were what I remembered off the top of my head while creating this video, so the points per stat may not be accurate. You can alter the below formula as necessary for your scoring purposes by adjusting the number values pretty easily though.
df['FP'] = df["PTS"]+1.2*df["REB"]+1.5*df["AST"]+3*df["STL"]+3*df["BLK"]-df["TOV"]
Now we can utilize the previously used head function to double check and make sure it worked as expected.
df.head()
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Timothe Luwawu-Cabarrot | BKN vs. SAC | 8/7/2020 | 27 | 10 | 6 | 0 | 0 | 0 | 0 | 17.2 |
1 | Lonzo Ball | NOP vs. WAS | 8/7/2020 | 31 | 2 | 9 | 5 | 1 | 3 | 3 | 29.3 |
2 | Caris LeVert | BKN vs. SAC | 8/7/2020 | 36 | 22 | 5 | 7 | 3 | 0 | 2 | 45.5 |
3 | Al Horford | PHI vs. ORL | 8/7/2020 | 30 | 21 | 9 | 2 | 0 | 1 | 1 | 36.8 |
4 | Terrence Ross | ORL @ PHI | 8/7/2020 | 26 | 0 | 3 | 2 | 2 | 1 | 0 | 15.6 |
Now, very similarly to how we created the James Harden dataframe, we are going to create a new dataframe to include all instances where a player scored at least 50 fantasy points. Just like the other dataframes, all analysis functions can be performed against this dataframe as well.
df50 = df[df["FP"]>=50]
df50.describe()
MIN | PTS | REB | AST | STL | BLK | TOV | FP | |
---|---|---|---|---|---|---|---|---|
count | 879.000000 | 879.000000 | 879.000000 | 879.000000 | 879.000000 | 879.000000 | 879.000000 | 879.000000 |
mean | 36.043231 | 30.170648 | 9.874858 | 6.370876 | 1.742890 | 1.278726 | 2.754266 | 57.887372 |
std | 4.254484 | 8.473219 | 4.834650 | 3.719231 | 1.350733 | 1.456551 | 1.927774 | 7.127299 |
min | 20.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 50.000000 |
25% | 34.000000 | 25.000000 | 6.000000 | 3.000000 | 1.000000 | 0.000000 | 1.000000 | 52.300000 |
50% | 36.000000 | 30.000000 | 10.000000 | 6.000000 | 2.000000 | 1.000000 | 2.000000 | 55.900000 |
75% | 39.000000 | 35.000000 | 13.000000 | 9.000000 | 3.000000 | 2.000000 | 4.000000 | 61.200000 |
max | 50.000000 | 61.000000 | 24.000000 | 18.000000 | 7.000000 | 10.000000 | 11.000000 | 88.800000 |
df50
Player | Match_Up | Game_Date | MIN | PTS | REB | AST | STL | BLK | TOV | FP | |
---|---|---|---|---|---|---|---|---|---|---|---|
123 | Michael Porter Jr. | DEN vs. POR | 8/6/2020 | 34 | 27 | 12 | 2 | 2 | 1 | 0 | 53.4 |
127 | Damian Lillard | POR @ DEN | 8/6/2020 | 41 | 45 | 4 | 12 | 3 | 0 | 2 | 74.8 |
139 | Deandre Ayton | PHX vs. IND | 8/6/2020 | 35 | 23 | 10 | 1 | 2 | 4 | 4 | 50.5 |
143 | James Harden | HOU vs. LAL | 8/6/2020 | 36 | 39 | 8 | 12 | 3 | 1 | 10 | 68.6 |
151 | Giannis Antetokounmpo | MIL vs. MIA | 8/6/2020 | 30 | 33 | 12 | 4 | 0 | 1 | 6 | 50.4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
19757 | Zach LaVine | CHI vs. IND | 1/10/2020 | 38 | 43 | 6 | 1 | 1 | 0 | 3 | 51.7 |
19766 | DeMar DeRozan | SAS @ MEM | 1/10/2020 | 38 | 36 | 9 | 9 | 1 | 0 | 1 | 62.3 |
19781 | Kawhi Leonard | LAC vs. GSW | 1/10/2020 | 38 | 36 | 9 | 5 | 3 | 1 | 5 | 61.3 |
19840 | Giannis Antetokounmpo | MIL vs. MIN | 1/1/2020 | 31 | 32 | 17 | 4 | 0 | 2 | 4 | 60.4 |
19851 | LeBron James | LAL vs. PHX | 1/1/2020 | 38 | 31 | 13 | 12 | 2 | 1 | 5 | 68.6 |
879 rows × 11 columns
That’ll do it for this tutorial. Remember to check out the corresponding youtube channel for more Daily Fantasy content, and subscribe to keep up-to-date with the most recent videos!