Box Scores Dataset

Hey everybody welcome back! Today we’re going to be taking our first in-depth look at the pandas data analysis package for python. It’s going to be a pretty high level introductory video as we are not going to be doing anything terribly complex today, just dipping our toes in the water and learning the basics. I know that I’ve used pandas several times in previous videos without really explaining much about it other than the fact that you manage data with it. An easy way to think of pandas is that it is a more pythonic version of Microsoft Excel. You handle tabular data with it, can perform mathematical functions, break out different elements of your data into more manageable subsets, and visualize your data as well.

First things first, we are going to want to pull up the documentation for pandas to provide us with a strong foundation.

https://pandas.pydata.org/docs/

We are fortunate with pandas as they have VERY user friendly documentation.

What is pandas?

Pandas allows you to work within what’s known as dataframes. A dataframe is really just a table of information, very similar to an excel worksheet, the information stored can be text, numerical, formulas, etc. Your dataframe is broken into columns (Series), and rows. Each row will have an indexed value at the far left of your dataframe, the default is to use numerical indexes starting with 0, but you can redefine one of your existing columns as your index if you prefer. For example, you may want to have your data indexed by player, rather than an arbitrary numerical value to make it easier to navigate and slice your data up into more meaningful subsets.

What is pandas good for??

You may be watching and reading this thinking yeah but I’m already an excel whiz, I have all x,y,z processes set up in excel and I REALLY don’t want to migrate all of that to pandas, so why I should I understand it?? That is a VERY valid question, and at the end of the day, you should use whatever tools you can be the most efficient with. Personally, I like to use pandas over excel when possible, I find it more agile, efficient, and has a broader toolset to work with. However even at it’s most basic functionality, pandas is easily the best file conversion tool I have ever come across. Pandas can read in data from at least 19 different file types, probably more that’s just all I am aware of off the top of my head, and can export into any of those same data types. This functionality allows you to take data that doesn’t play nice with excel, and allows you to convert it into a nice tabular format and export straight into an .xlsx file format for future use. Incredibly convenient!

Okay, back to NBA stats!

To start using pandas we will need to read in our data using the pandas read function. There are quite a few optional parameters you can use as a part of this function, I recommend checking out the documentation for it if you are wanting a more targeted upload of data. Examples include specifying which columns to bring into pandas from the original source, auto-converting values to a specific data type, etc.

import pandas as pd

df = pd.read_excel(r"C:\Users\nfwya\Fanduel\BoxScores\20200808clean.xlsx")

Once we have the dataframe created and we can see our data is all here, there are a few functions within pandas you will want to get familiar with, especially for statistical analysis, but these can be very useful for dataset with varying data types as well. First up is the head and tail commands. These commands allow you to view the first x records in your dataframe using head, and the last x records in your dataframe using tail. This can be helpful if you are working with a large dataset and don’t want to have to render the entire thing each time you need to check and see the results of your work. An easy example of how this is beneficial is if you are sorting based on different variables, especially if it is a more complex sort. You can quickly check to make sure the rightful values are at the beginning and end of your sorted dataframe as a sanity check without viewing the entire thing.

df.head()

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV
0	Timothe Luwawu-Cabarrot	BKN vs. SAC	8/7/2020	27	10	6	0	0	0	0
1	Lonzo Ball	NOP vs. WAS	8/7/2020	31	2	9	5	1	3	3
2	Caris LeVert	BKN vs. SAC	8/7/2020	36	22	5	7	3	0	2
3	Al Horford	PHI vs. ORL	8/7/2020	30	21	9	2	0	1	1
4	Terrence Ross	ORL @ PHI	8/7/2020	26	0	3	2	2	1	0

Next up, you will want to get used to using the describe function. Checking the output of this function is often a first step I take in evaluating any numerical data source, as it can give me a single output that allows me to make sure that the number both make sense, and are what I was expecting. In addition to this, you can use the statistical values provided to aid in further analysis on a player by player or team by team level to give you a little more insight.

df.describe()

	MIN	PTS	REB	AST	STL	BLK	TOV
count	19872.000000	19872.000000	19872.000000	19872.000000	19872.000000	19872.000000	19872.000000
mean	24.548058	11.371679	4.548360	2.478915	0.777426	0.499396	1.407911
std	8.836700	8.454256	3.461526	2.535612	0.974553	0.868078	1.434660
min	6.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	18.000000	5.000000	2.000000	1.000000	0.000000	0.000000	0.000000
50%	25.000000	10.000000	4.000000	2.000000	1.000000	0.000000	1.000000
75%	32.000000	16.000000	6.000000	4.000000	1.000000	1.000000	2.000000
max	52.000000	61.000000	25.000000	19.000000	7.000000	10.000000	11.000000

Next up is the info function, the info function will give you a quick overview of the count of non-null values in that column, as well as the data type of that column. This isn’t terribly valuable right now, but if you are doing data prep or cleanup to feed into a model in the future, this will become an invaluable tool.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19872 entries, 0 to 19871
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Player     19872 non-null  object
 1   Match_Up   19872 non-null  object
 2   Game_Date  19872 non-null  object
 3   MIN        19872 non-null  int64 
 4   PTS        19872 non-null  int64 
 5   REB        19872 non-null  int64 
 6   AST        19872 non-null  int64 
 7   STL        19872 non-null  int64 
 8   BLK        19872 non-null  int64 
 9   TOV        19872 non-null  int64 
dtypes: int64(7), object(3)
memory usage: 1.5+ MB

In addition to these big picture analysis functions, it is also pretty easily to create a new dataframe of a subset of the data, for instance if you want to create a new datafame just to look at James Harden’s stats, you can simply define a new dataframe as a subset of the current dataframe as shown below:

HardenDF = df[df['Player'] == "James Harden"]

Now you can perform any functions as you performed on the full dataset, but just for James Harden by referencing the HardenDF dataframe instead of the df dataframe.

HardenDF.head()

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV
143	James Harden	HOU vs. LAL	8/6/2020	36	39	8	12	3	1	10
410	James Harden	HOU @ POR	8/4/2020	33	23	6	9	0	0	4
629	James Harden	HOU vs. MIL	8/2/2020	34	24	7	7	6	0	3
907	James Harden	HOU @ DAL	7/31/2020	43	49	9	8	3	3	1
1141	James Harden	HOU vs. ORL	3/8/2020	32	23	7	6	3	0	4

HardenDF.describe()

	MIN	PTS	REB	AST	STL	BLK	TOV
count	65.000000	65.000000	65.000000	65.000000	65.000000	65.000000	65.000000
mean	36.676923	34.323077	6.430769	7.476923	1.815385	0.876923	4.476923
std	4.362868	10.894534	2.783709	2.652285	1.309727	0.943857	2.278347
min	26.000000	12.000000	1.000000	3.000000	0.000000	0.000000	0.000000
25%	35.000000	28.000000	4.000000	6.000000	1.000000	0.000000	3.000000
50%	37.000000	35.000000	6.000000	7.000000	2.000000	1.000000	4.000000
75%	39.000000	41.000000	8.000000	9.000000	3.000000	1.000000	6.000000
max	49.000000	60.000000	16.000000	14.000000	6.000000	3.000000	11.000000

It is also possible to create a new column and calculate a value based on the values in the existing dataframe. For instance if you wanted to add a column for fantasy points scored, you can create a new column, and add in the points variables for each statistic and sum it up. These scoring metrics were what I remembered off the top of my head while creating this video, so the points per stat may not be accurate. You can alter the below formula as necessary for your scoring purposes by adjusting the number values pretty easily though.

df['FP'] = df["PTS"]+1.2*df["REB"]+1.5*df["AST"]+3*df["STL"]+3*df["BLK"]-df["TOV"]

Now we can utilize the previously used head function to double check and make sure it worked as expected.

df.head()

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV	FP
0	Timothe Luwawu-Cabarrot	BKN vs. SAC	8/7/2020	27	10	6	0	0	0	0	17.2
1	Lonzo Ball	NOP vs. WAS	8/7/2020	31	2	9	5	1	3	3	29.3
2	Caris LeVert	BKN vs. SAC	8/7/2020	36	22	5	7	3	0	2	45.5
3	Al Horford	PHI vs. ORL	8/7/2020	30	21	9	2	0	1	1	36.8
4	Terrence Ross	ORL @ PHI	8/7/2020	26	0	3	2	2	1	0	15.6

Now, very similarly to how we created the James Harden dataframe, we are going to create a new dataframe to include all instances where a player scored at least 50 fantasy points. Just like the other dataframes, all analysis functions can be performed against this dataframe as well.

df50 = df[df["FP"]>=50]

df50.describe()

	MIN	PTS	REB	AST	STL	BLK	TOV	FP
count	879.000000	879.000000	879.000000	879.000000	879.000000	879.000000	879.000000	879.000000
mean	36.043231	30.170648	9.874858	6.370876	1.742890	1.278726	2.754266	57.887372
std	4.254484	8.473219	4.834650	3.719231	1.350733	1.456551	1.927774	7.127299
min	20.000000	5.000000	0.000000	0.000000	0.000000	0.000000	0.000000	50.000000
25%	34.000000	25.000000	6.000000	3.000000	1.000000	0.000000	1.000000	52.300000
50%	36.000000	30.000000	10.000000	6.000000	2.000000	1.000000	2.000000	55.900000
75%	39.000000	35.000000	13.000000	9.000000	3.000000	2.000000	4.000000	61.200000
max	50.000000	61.000000	24.000000	18.000000	7.000000	10.000000	11.000000	88.800000

df50

	Player	Match_Up	Game_Date	MIN	PTS	REB	AST	STL	BLK	TOV	FP
123	Michael Porter Jr.	DEN vs. POR	8/6/2020	34	27	12	2	2	1	0	53.4
127	Damian Lillard	POR @ DEN	8/6/2020	41	45	4	12	3	0	2	74.8
139	Deandre Ayton	PHX vs. IND	8/6/2020	35	23	10	1	2	4	4	50.5
143	James Harden	HOU vs. LAL	8/6/2020	36	39	8	12	3	1	10	68.6
151	Giannis Antetokounmpo	MIL vs. MIA	8/6/2020	30	33	12	4	0	1	6	50.4
...	...	...	...	...	...	...	...	...	...	...	...
19757	Zach LaVine	CHI vs. IND	1/10/2020	38	43	6	1	1	0	3	51.7
19766	DeMar DeRozan	SAS @ MEM	1/10/2020	38	36	9	9	1	0	1	62.3
19781	Kawhi Leonard	LAC vs. GSW	1/10/2020	38	36	9	5	3	1	5	61.3
19840	Giannis Antetokounmpo	MIL vs. MIN	1/1/2020	31	32	17	4	0	2	4	60.4
19851	LeBron James	LAL vs. PHX	1/1/2020	38	31	13	12	2	1	5	68.6

879 rows × 11 columns

That’ll do it for this tutorial. Remember to check out the corresponding youtube channel for more Daily Fantasy content, and subscribe to keep up-to-date with the most recent videos!

Introduction to Pandas and Dataframes

Analyzing Data with Pandas