We’re finally here, ready to start going over the basics to predicting fantasy sports scores!
We are going to run through a very basic example of using a single decision tree regression analysis to go over some of the basics and terminology in this tutorial.

To start with we need to go ahead and import all our modules we will be using today. As usual, if you have not used any of these modules before you will need to pip install them to have access.

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeRegressor

Next up we will need to import our data into a pandas dataframe, my dataset is going to be a pretty basic dataset using the rolling averages from the last 3, 5, and 7 games from some point in the 2019-2020 season.

dataset = pd.read_excel("SampleDataForYT.xlsx")

dataset.head()

	Player	Match_Up	Game_Date	Last3	Last5	Last7	SeasonAve	FP
0	Aaron Gordon	ORL vs. CLE	2019-10-23	23.900000	23.900000	23.900000	29.914286	23.9
1	Aaron Gordon	ORL @ ATL	2019-10-26	25.900000	25.900000	25.900000	29.914286	27.9
2	Aaron Gordon	ORL @ TOR	2019-10-28	23.433333	23.433333	23.433333	29.914286	18.5
3	Aaron Gordon	ORL vs. NYK	2019-10-30	27.366667	26.500000	26.500000	29.914286	35.7
4	Aaron Gordon	ORL vs. MIL	2019-11-01	24.500000	25.060000	25.060000	29.914286	19.3

Now we need to pare this dataset down to the columns we are actually going to be using.

dataset = dataset[['Player', 'Last3', 'Last5', 'Last7', 'FP']]

Then we are going to take a peek at just the Aaron Gordon data and see if it came through correctly.

datasetAG = dataset[dataset['Player'] == 'Aaron Gordon']

datasetAG.head()

	Player	Last3	Last5	Last7	FP
0	Aaron Gordon	23.900000	23.900000	23.900000	23.9
1	Aaron Gordon	25.900000	25.900000	25.900000	27.9
2	Aaron Gordon	23.433333	23.433333	23.433333	18.5
3	Aaron Gordon	27.366667	26.500000	26.500000	35.7
4	Aaron Gordon	24.500000	25.060000	25.060000	19.3

datasetAG.tail()

	Player	Last3	Last5	Last7	FP
44	Aaron Gordon	41.800000	36.26	32.742857	45.0
45	Aaron Gordon	39.400000	36.02	32.628571	30.3
46	Aaron Gordon	34.533333	36.80	34.271429	28.3
47	Aaron Gordon	34.133333	38.06	36.028571	43.8
48	Aaron Gordon	42.133333	40.34	40.300000	54.3

Now a few terminology points here that you will want to become familiar with if you want to continue to learn any machine learning methods going forward:

Feaures
- Features will generally refer to the data points you want your algorithm to learn from. This can consist of stats, location, or anything repeatable that you can provide for an event that has yet to happen that can be predicted on.
Labels
- Labels will generally refer to what you are trying to predict. You will need labels for any supervised learning algorithm, as the entire point is to have fully fleshed out examples for the algorithm to learn how to predict. This can be fantasy points, fantasy points per minute, or whatever you want to actually predict.

Now we are going to go ahead and define our feature columns and label column, and just look at Aaron Gordon’s dataset for each.

featureNames = ['Last3', 'Last5', 'Last7']
labelName = ['FP']
dfFeatures = datasetAG[['Last3', 'Last5', 'Last7']]
dfFeatures.head()

	Last3	Last5	Last7
0	23.900000	23.900000	23.900000
1	25.900000	25.900000	25.900000
2	23.433333	23.433333	23.433333
3	27.366667	26.500000	26.500000
4	24.500000	25.060000	25.060000

dfLabels = datasetAG[['FP']]

dfLabels.head()

	FP
0	23.9
1	27.9
2	18.5
3	35.7
4	19.3

Now we will need to convert these pandas dataframes into numpy arrays.

Array
- An array is essentially just a numerical dataframe. If you are familiar with your mathematical terminology another way to think of an array is a matrix. It can be numerical only so if you have text based data you want to incorporate you will need to hot encode it to numerical values to predict with.

labels = np.array(dfLabels)
features = np.array(dfFeatures)

labels

array([[23.9],
       [27.9],
       [18.5],
       [35.7],
       [19.3],
       [36.3],
       [25.9],
       [31.3],
       [25.8],
       [21.9],
       [48.1],
       [30.6],
       [26.8],
       [ 4.4],
       [29.6],
       [19.4],
       [34.2],
       [47.5],
       [27. ],
       [31.6],
       [39.8],
       [33.2],
       [17.8],
       [14. ],
       [25.9],
       [33.2],
       [45.2],
       [35.7],
       [12.4],
       [29.1],
       [31.4],
       [29.3],
       [24.4],
       [35.8],
       [34.7],
       [27.5],
       [19.1],
       [25.7],
       [31.1],
       [16.8],
       [31.5],
       [24.4],
       [37.5],
       [42.9],
       [45. ],
       [30.3],
       [28.3],
       [43.8],
       [54.3]])

Once we have our data converted to numpy arrays we are ready to start prepping for our machine learning algorithm. Now, there are two core concepts here when it comes to supervised learning, and that is the training dataset and the testing dataset.

Training dataset
- The training dataset is going to be a subset of your total dataset, usually at least 50% of the data will be incorporated into the training dataset. This is how your algorithm ‘learns’. It is important not to have a small dataset or a dataset not representative of the whole, or you will risk overfitting your model to ONLY be accurate with this training dataset.
Testing Dataset
- The training dataset is the dataset you test your model against to check for accuracy. This will be composed of the remainder of the dataset not used in the training data. It is imperative not to mix these 2 datasets or you will be stuck in an echo chamber of thinking your model is great because it is being tested against the same data it trained on.

I like to use at MINIMUM a 60/40 split for train/test datasets. If your data is likely to be very similar the points around it, it is recommended to shuffle your data prior to splitting it, as the following function will simply take the first 60% of datapoints to train, and the last 40% to train.

train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = 0)

Now, just for clarity I will display the train and test datasets so you can see the relative sizes and that they are not repeated.

train

array([[30.26666667, 29.88      , 33.01428571],
       [27.73333333, 22.88      , 27.58571429],
       [24.23333333, 25.9       , 25.15714286],
       [32.8       , 36.02      , 32.72857143],
       [31.13333333, 28.26      , 26.58571429],
       [27.66666667, 27.72      , 27.54285714],
       [20.6       , 26.36      , 26.98571429],
       [24.36666667, 24.82      , 27.92857143],
       [30.43333333, 27.54      , 26.93333333],
       [33.7       , 27.02      , 27.5       ],
       [32.66666667, 30.34      , 30.31428571],
       [20.26666667, 27.9       , 26.74285714],
       [25.3       , 27.62      , 28.32857143],
       [25.9       , 25.9       , 25.9       ],
       [35.16666667, 30.64      , 30.05714286],
       [34.93333333, 30.62      , 29.98571429],
       [19.23333333, 26.14      , 27.04285714],
       [27.16666667, 27.14      , 26.78571429],
       [21.66666667, 27.28      , 30.12857143],
       [27.1       , 28.3       , 28.88571429],
       [34.86666667, 35.82      , 33.24285714],
       [35.36666667, 31.94      , 27.67142857],
       [26.33333333, 28.24      , 28.02857143],
       [24.53333333, 24.04      , 27.24285714],
       [39.4       , 36.02      , 32.62857143],
       [27.36666667, 26.5       , 26.5       ],
       [23.9       , 23.9       , 23.9       ],
       [34.13333333, 38.06      , 36.02857143],
       [41.8       , 36.26      , 32.74285714]])

test

array([[25.73333333, 31.12      , 27.92857143],
       [24.5       , 25.06      , 25.06      ],
       [34.76666667, 27.22      , 29.87142857],
       [24.3       , 30.76      , 30.41428571],
       [28.36666667, 25.32      , 29.64285714],
       [24.1       , 28.56      , 28.07142857],
       [31.63333333, 31.12      , 28.15714286],
       [26.46666667, 24.84      , 26.62857143],
       [31.16666667, 29.7       , 27.84285714],
       [31.93333333, 30.6       , 29.8       ],
       [33.53333333, 31.54      , 31.41428571],
       [29.93333333, 27.58      , 30.9       ],
       [29.83333333, 30.        , 28.3       ],
       [38.03333333, 30.8       , 29.28571429],
       [42.13333333, 40.34      , 40.3       ],
       [23.43333333, 23.43333333, 23.43333333],
       [34.53333333, 36.8       , 34.27142857],
       [36.23333333, 31.54      , 26.98571429],
       [17.8       , 22.16      , 25.82857143],
       [31.1       , 30.48      , 26.31428571]])

trainLabels

array([[17.8],
       [34.2],
       [24.4],
       [39.8],
       [37.5],
       [25.8],
       [ 4.4],
       [33.2],
       [36.3],
       [47.5],
       [27.5],
       [29.6],
       [31.1],
       [27.9],
       [26.8],
       [42.9],
       [25.9],
       [25.9],
       [14. ],
       [19.1],
       [33.2],
       [31.6],
       [21.9],
       [16.8],
       [30.3],
       [35.7],
       [23.9],
       [43.8],
       [45. ]])

testLabels

array([[29.1],
       [19.3],
       [45.2],
       [31.4],
       [24.4],
       [25.7],
       [34.7],
       [31.5],
       [31.3],
       [48.1],
       [30.6],
       [29.3],
       [35.8],
       [35.7],
       [54.3],
       [18.5],
       [28.3],
       [27. ],
       [19.4],
       [12.4]])

Okay, now that we have segregated our datasets, we are ready to establish our decision tree.

tree = DecisionTreeRegressor(random_state=0)

Now that our decision tree is defined, we can go ahead and train it on our train data and take a look at the depth and number of nodes.

tree.fit(train, trainLabels)
print(f'Decision tree has {tree.tree_.node_count} nodes with maximum depth {tree.tree_.max_depth}.')

Decision tree has 57 nodes with maximum depth 8.

Now, I think it is important to note that this is not the ideal decision tree, but for the sake of this tutorial we will continue with it, while getting further into what makes a ‘good’ or ‘bad’ tree in future tutorials.

Moving forward, we have our decision tree, now we are ready to take a look at the prediction it will put out and analyze the results.

train_predictions = tree.predict(train)
predictions = tree.predict(test)

df3 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])

df3.describe()

	Last3	Last5	Last7
count	20.000000	20.000000	20.000000
mean	29.976667	29.448667	29.022524
std	5.795775	4.288313	3.604194
min	17.800000	22.160000	23.433333
25%	25.425000	26.745000	26.896429
50%	30.516667	30.240000	28.228571
75%	33.783333	31.120000	30.007143
max	42.133333	40.340000	40.300000

df3['actual'] = testLabels
df3['predicted'] = predictions
df3['error'] = abs(df3['actual'] - df3['predicted'])

df3.head()

	Last3	Last5	Last7	actual	predicted	error
0	25.733333	31.12	27.928571	29.1	31.1	2.0
1	24.500000	25.06	25.060000	19.3	24.4	5.1
2	34.766667	27.22	29.871429	45.2	47.5	2.3
3	24.300000	30.76	30.414286	31.4	14.0	17.4
4	28.366667	25.32	29.642857	24.4	19.1	5.3

As we can see in the ‘error’ column above, there are some pretty big discrepancies between the actual score and the predicted score for that particular game! And that’s OKAY! No model is perfect and never will be, if everything could be predicted perfectly what would be the point in living right? Anyways, all that means is that we have some work to do, which we already knew before we saw those numbers.

One way to possibly increase the accuracy of our model is to standardize our data input prior to training our data. Standardizing our data will make for a smaller range of data to work with, while still maintaining the relative difference in scores. This generally increases the accuracy of the model as there are fewer big swings in data to work with. Let’s give it a shot and see if it makes a difference.

x = features

x= StandardScaler().fit_transform(x)

array([[-0.96346636, -1.28665376, -1.59744282],
       [-0.61176645, -0.79589584, -0.95444901],
       [-1.04552968, -1.40116394, -1.74747471],
       [-0.35385318, -0.64866846, -0.76155087],
       [-0.85795639, -1.00201416, -1.22450641],
       [ 0.18542002, -0.39347434, -0.62223554],
       [-0.38902317, -0.49162593, -0.66969461],
       [ 0.31437666,  0.13654421, -0.32982645],
       [-0.30109819, -0.34930613, -0.42627553],
       [-0.5355648 , -0.22170907, -0.27011989],
       [ 0.44919496,  0.35738527,  0.29938891],
       [ 0.73055489,  0.5880415 ,  0.81837677],
       [ 1.01777649,  0.36720043,  0.38205955],
       [-1.54377122, -0.68302152, -0.60539523],
       [-1.60238787, -0.30513792, -0.68347305],
       [-2.0361511 , -1.71361315, -0.97741307],
       [-0.28937486, -1.5369403 , -0.41249709],
       [ 0.75986322, -0.5210714 , -0.44005396],
       [ 1.20534977,  0.5880415 , -0.60539523],
       [ 1.05294648,  0.68619308, -0.38494021],
       [ 0.60159825,  1.68733924,  1.24091556],
       [ 0.9650215 ,  1.63826344,  1.40625683],
       [ 0.1561117 ,  0.18071242,  1.33277182],
       [-1.35619793, -0.45727287,  0.40502361],
       [-1.7840995 , -0.73700489, -0.58702398],
       [-0.88140305, -1.06090511, -0.30226958],
       [ 0.9474365 , -0.47199561,  0.32235298],
       [ 1.5218797 ,  0.40646107,  0.13404765],
       [ 0.30265333,  0.3279398 , -0.82125744],
       [-0.64107477,  0.48498233, -0.30226958],
       [-0.89312638,  0.39664591,  0.49687987],
       [ 0.09749504, -0.38365918,  0.65303551],
       [-0.17800322, -0.93821563,  0.24886797],
       [ 0.07991005,  0.2101579 , -0.18285644],
       [ 0.39643997,  0.48498233, -0.22878457],
       [ 0.57815159,  0.29358674,  0.46473018],
       [-0.4007465 , -0.20698633,  0.00544889],
       [-0.92829637, -0.1431878 , -0.25634145],
       [-0.71727642, -0.37384403, -0.17367082],
       [-0.85209472, -1.2523007 , -0.5227246 ],
       [-0.51211814, -1.05599753, -0.72021555],
       [-0.90484971, -0.79589584, -1.19327528],
       [ 0.30851499, -0.21680149, -0.73399399],
       [ 0.97674483,  0.36229285,  0.35909548],
       [ 2.18424787,  1.74623019,  1.24550837],
       [ 1.76220797,  1.68733924,  1.20876587],
       [ 0.90640485,  1.87873482,  1.73693936],
       [ 0.83606486,  2.18791231,  2.30185534],
       [ 2.24286452,  2.74737634,  3.6751064 ]])

train, test, trainLabels, testLabels = train_test_split(x, labels, test_size=(0.4), random_state = 0)

reg = DecisionTreeRegressor(random_state=50)

reg.fit(train, trainLabels)

DecisionTreeRegressor(random_state=50)

train_predictions = reg.predict(train)
predictions = reg.predict(test)

df4 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df4['actual'] = testLabels
df4['predicted'] = predictions
df4['error'] = abs(df4['actual'] - df4['predicted'])

There’s a lot to digest in that code block, but it’s pretty much the exact same process as before, just with the standardization occurring at the beginning. So now we can compare the two outcomes and see how different methods may produce different results.

df3.describe()

	Last3	Last5	Last7	actual	predicted	error
count	20.000000	20.000000	20.000000	20.000000	20.000000	20.00000
mean	29.976667	29.448667	29.022524	30.600000	29.150000	7.96000
std	5.795775	4.288313	3.604194	10.178202	8.367953	5.58432
min	17.800000	22.160000	23.433333	12.400000	14.000000	1.90000
25%	25.425000	26.745000	26.896429	25.375000	25.450000	4.02500
50%	30.516667	30.240000	28.228571	29.950000	27.500000	5.40000
75%	33.783333	31.120000	30.007143	34.950000	31.225000	10.27500
max	42.133333	40.340000	40.300000	54.300000	47.500000	20.60000

df4.describe()

	Last3	Last5	Last7	actual	predicted	error
count	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000
mean	0.105115	0.074872	0.049433	30.600000	29.105000	8.095000
std	1.019187	1.052262	1.158737	10.178202	8.498203	5.387166
min	-2.036151	-1.713613	-1.747475	12.400000	14.000000	2.000000
25%	-0.695295	-0.588551	-0.634100	25.375000	25.450000	4.025000
50%	0.200074	0.269049	-0.205821	29.950000	27.500000	6.300000
75%	0.774517	0.484982	0.365985	34.950000	31.600000	10.375000
max	2.242865	2.747376	3.675106	54.300000	47.500000	20.600000

As we can see here, in this specific instance, standardizing the data did not have a very large impact on our end result. Something you may want to try in the future is altering the random state parameter to try to create a more accurate model.

Intro to Predicting Fantasy Sports Scores

Pickling our Models for Future Use

Identifying NBA Player Archetypes Using K-Means Clustering — Part Two