Introduction to Pandas and Dataframes

Code as .ipynb

Dataset Used

Box Scores Dataset

Hey everybody welcome back! Today we’re going to be taking our first in-depth look at the pandas data analysis package for python. It’s going to be a pretty high level introductory video as we are not going to be doing anything terribly complex today, just dipping our toes in the water and learning the basics. I know that I’ve used pandas several times in previous videos without really explaining much about it other than the fact that you manage data with it. An easy way to think of pandas is that it is a more pythonic version of Microsoft Excel. You handle tabular data with it, can perform mathematical functions, break out different elements of your data into more manageable subsets, and visualize your data as well.

First things first, we are going to want to pull up the documentation for pandas to provide us with a strong foundation.

https://pandas.pydata.org/docs/

We are fortunate with pandas as they have VERY user friendly documentation.

What is pandas?

Pandas allows you to work within what’s known as dataframes. A dataframe is really just a table of information, very similar to an excel worksheet, the information stored can be text, numerical, formulas, etc. Your dataframe is broken into columns (Series), and rows. Each row will have an indexed value at the far left of your dataframe, the default is to use numerical indexes starting with 0, but you can redefine one of your existing columns as your index if you prefer. For example, you may want to have your data indexed by player, rather than an arbitrary numerical value to make it easier to navigate and slice your data up into more meaningful subsets.

What is pandas good for??

You may be watching and reading this thinking yeah but I’m already an excel whiz, I have all x,y,z processes set up in excel and I REALLY don’t want to migrate all of that to pandas, so why I should I understand it?? That is a VERY valid question, and at the end of the day, you should use whatever tools you can be the most efficient with. Personally, I like to use pandas over excel when possible, I find it more agile, efficient, and has a broader toolset to work with. However even at it’s most basic functionality, pandas is easily the best file conversion tool I have ever come across. Pandas can read in data from at least 19 different file types, probably more that’s just all I am aware of off the top of my head, and can export into any of those same data types. This functionality allows you to take data that doesn’t play nice with excel, and allows you to convert it into a nice tabular format and export straight into an .xlsx file format for future use. Incredibly convenient!

Okay, back to NBA stats!

To start using pandas we will need to read in our data using the pandas read function. There are quite a few optional parameters you can use as a part of this function, I recommend checking out the documentation for it if you are wanting a more targeted upload of data. Examples include specifying which columns to bring into pandas from the original source, auto-converting values to a specific data type, etc.

import pandas as pd
df = pd.read_excel(r"C:\Users\nfwya\Fanduel\BoxScores\20200808clean.xlsx")

Once we have the dataframe created and we can see our data is all here, there are a few functions within pandas you will want to get familiar with, especially for statistical analysis, but these can be very useful for dataset with varying data types as well. First up is the head and tail commands. These commands allow you to view the first x records in your dataframe using head, and the last x records in your dataframe using tail. This can be helpful if you are working with a large dataset and don’t want to have to render the entire thing each time you need to check and see the results of your work. An easy example of how this is beneficial is if you are sorting based on different variables, especially if it is a more complex sort. You can quickly check to make sure the rightful values are at the beginning and end of your sorted dataframe as a sanity check without viewing the entire thing.

df.head()
Player Match_Up Game_Date MIN PTS REB AST STL BLK TOV
0 Timothe Luwawu-Cabarrot BKN vs. SAC 8/7/2020 27 10 6 0 0 0 0
1 Lonzo Ball NOP vs. WAS 8/7/2020 31 2 9 5 1 3 3
2 Caris LeVert BKN vs. SAC 8/7/2020 36 22 5 7 3 0 2
3 Al Horford PHI vs. ORL 8/7/2020 30 21 9 2 0 1 1
4 Terrence Ross ORL @ PHI 8/7/2020 26 0 3 2 2 1 0

Next up, you will want to get used to using the describe function. Checking the output of this function is often a first step I take in evaluating any numerical data source, as it can give me a single output that allows me to make sure that the number both make sense, and are what I was expecting. In addition to this, you can use the statistical values provided to aid in further analysis on a player by player or team by team level to give you a little more insight.

df.describe()
MIN PTS REB AST STL BLK TOV
count 19872.000000 19872.000000 19872.000000 19872.000000 19872.000000 19872.000000 19872.000000
mean 24.548058 11.371679 4.548360 2.478915 0.777426 0.499396 1.407911
std 8.836700 8.454256 3.461526 2.535612 0.974553 0.868078 1.434660
min 6.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 18.000000 5.000000 2.000000 1.000000 0.000000 0.000000 0.000000
50% 25.000000 10.000000 4.000000 2.000000 1.000000 0.000000 1.000000
75% 32.000000 16.000000 6.000000 4.000000 1.000000 1.000000 2.000000
max 52.000000 61.000000 25.000000 19.000000 7.000000 10.000000 11.000000

Next up is the info function, the info function will give you a quick overview of the count of non-null values in that column, as well as the data type of that column. This isn’t terribly valuable right now, but if you are doing data prep or cleanup to feed into a model in the future, this will become an invaluable tool.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19872 entries, 0 to 19871
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Player     19872 non-null  object
 1   Match_Up   19872 non-null  object
 2   Game_Date  19872 non-null  object
 3   MIN        19872 non-null  int64 
 4   PTS        19872 non-null  int64 
 5   REB        19872 non-null  int64 
 6   AST        19872 non-null  int64 
 7   STL        19872 non-null  int64 
 8   BLK        19872 non-null  int64 
 9   TOV        19872 non-null  int64 
dtypes: int64(7), object(3)
memory usage: 1.5+ MB

In addition to these big picture analysis functions, it is also pretty easily to create a new dataframe of a subset of the data, for instance if you want to create a new datafame just to look at James Harden’s stats, you can simply define a new dataframe as a subset of the current dataframe as shown below:

HardenDF = df[df['Player'] == "James Harden"]

Now you can perform any functions as you performed on the full dataset, but just for James Harden by referencing the HardenDF dataframe instead of the df dataframe.

HardenDF.head()
Player Match_Up Game_Date MIN PTS REB AST STL BLK TOV
143 James Harden HOU vs. LAL 8/6/2020 36 39 8 12 3 1 10
410 James Harden HOU @ POR 8/4/2020 33 23 6 9 0 0 4
629 James Harden HOU vs. MIL 8/2/2020 34 24 7 7 6 0 3
907 James Harden HOU @ DAL 7/31/2020 43 49 9 8 3 3 1
1141 James Harden HOU vs. ORL 3/8/2020 32 23 7 6 3 0 4
HardenDF.describe()
MIN PTS REB AST STL BLK TOV
count 65.000000 65.000000 65.000000 65.000000 65.000000 65.000000 65.000000
mean 36.676923 34.323077 6.430769 7.476923 1.815385 0.876923 4.476923
std 4.362868 10.894534 2.783709 2.652285 1.309727 0.943857 2.278347
min 26.000000 12.000000 1.000000 3.000000 0.000000 0.000000 0.000000
25% 35.000000 28.000000 4.000000 6.000000 1.000000 0.000000 3.000000
50% 37.000000 35.000000 6.000000 7.000000 2.000000 1.000000 4.000000
75% 39.000000 41.000000 8.000000 9.000000 3.000000 1.000000 6.000000
max 49.000000 60.000000 16.000000 14.000000 6.000000 3.000000 11.000000

It is also possible to create a new column and calculate a value based on the values in the existing dataframe. For instance if you wanted to add a column for fantasy points scored, you can create a new column, and add in the points variables for each statistic and sum it up. These scoring metrics were what I remembered off the top of my head while creating this video, so the points per stat may not be accurate. You can alter the below formula as necessary for your scoring purposes by adjusting the number values pretty easily though.

df['FP'] = df["PTS"]+1.2*df["REB"]+1.5*df["AST"]+3*df["STL"]+3*df["BLK"]-df["TOV"]

Now we can utilize the previously used head function to double check and make sure it worked as expected.

df.head()
Player Match_Up Game_Date MIN PTS REB AST STL BLK TOV FP
0 Timothe Luwawu-Cabarrot BKN vs. SAC 8/7/2020 27 10 6 0 0 0 0 17.2
1 Lonzo Ball NOP vs. WAS 8/7/2020 31 2 9 5 1 3 3 29.3
2 Caris LeVert BKN vs. SAC 8/7/2020 36 22 5 7 3 0 2 45.5
3 Al Horford PHI vs. ORL 8/7/2020 30 21 9 2 0 1 1 36.8
4 Terrence Ross ORL @ PHI 8/7/2020 26 0 3 2 2 1 0 15.6

Now, very similarly to how we created the James Harden dataframe, we are going to create a new dataframe to include all instances where a player scored at least 50 fantasy points. Just like the other dataframes, all analysis functions can be performed against this dataframe as well.

df50 = df[df["FP"]>=50]
df50.describe()
MIN PTS REB AST STL BLK TOV FP
count 879.000000 879.000000 879.000000 879.000000 879.000000 879.000000 879.000000 879.000000
mean 36.043231 30.170648 9.874858 6.370876 1.742890 1.278726 2.754266 57.887372
std 4.254484 8.473219 4.834650 3.719231 1.350733 1.456551 1.927774 7.127299
min 20.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50.000000
25% 34.000000 25.000000 6.000000 3.000000 1.000000 0.000000 1.000000 52.300000
50% 36.000000 30.000000 10.000000 6.000000 2.000000 1.000000 2.000000 55.900000
75% 39.000000 35.000000 13.000000 9.000000 3.000000 2.000000 4.000000 61.200000
max 50.000000 61.000000 24.000000 18.000000 7.000000 10.000000 11.000000 88.800000
df50
Player Match_Up Game_Date MIN PTS REB AST STL BLK TOV FP
123 Michael Porter Jr. DEN vs. POR 8/6/2020 34 27 12 2 2 1 0 53.4
127 Damian Lillard POR @ DEN 8/6/2020 41 45 4 12 3 0 2 74.8
139 Deandre Ayton PHX vs. IND 8/6/2020 35 23 10 1 2 4 4 50.5
143 James Harden HOU vs. LAL 8/6/2020 36 39 8 12 3 1 10 68.6
151 Giannis Antetokounmpo MIL vs. MIA 8/6/2020 30 33 12 4 0 1 6 50.4
... ... ... ... ... ... ... ... ... ... ... ...
19757 Zach LaVine CHI vs. IND 1/10/2020 38 43 6 1 1 0 3 51.7
19766 DeMar DeRozan SAS @ MEM 1/10/2020 38 36 9 9 1 0 1 62.3
19781 Kawhi Leonard LAC vs. GSW 1/10/2020 38 36 9 5 3 1 5 61.3
19840 Giannis Antetokounmpo MIL vs. MIN 1/1/2020 31 32 17 4 0 2 4 60.4
19851 LeBron James LAL vs. PHX 1/1/2020 38 31 13 12 2 1 5 68.6

879 rows × 11 columns

That’ll do it for this tutorial. Remember to check out the corresponding youtube channel for more Daily Fantasy content, and subscribe to keep up-to-date with the most recent videos!

Previous
Previous

Analyzing Data with Pandas