Intro to Predicting Fantasy Sports Scores

We’re finally here, ready to start going over the basics to predicting fantasy sports scores!
We are going to run through a very basic example of using a single decision tree regression analysis to go over some of the basics and terminology in this tutorial.

To start with we need to go ahead and import all our modules we will be using today. As usual, if you have not used any of these modules before you will need to pip install them to have access.

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeRegressor

Next up we will need to import our data into a pandas dataframe, my dataset is going to be a pretty basic dataset using the rolling averages from the last 3, 5, and 7 games from some point in the 2019-2020 season.

dataset = pd.read_excel("SampleDataForYT.xlsx")
dataset.head()
Player Match_Up Game_Date Last3 Last5 Last7 SeasonAve FP
0 Aaron Gordon ORL vs. CLE 2019-10-23 23.900000 23.900000 23.900000 29.914286 23.9
1 Aaron Gordon ORL @ ATL 2019-10-26 25.900000 25.900000 25.900000 29.914286 27.9
2 Aaron Gordon ORL @ TOR 2019-10-28 23.433333 23.433333 23.433333 29.914286 18.5
3 Aaron Gordon ORL vs. NYK 2019-10-30 27.366667 26.500000 26.500000 29.914286 35.7
4 Aaron Gordon ORL vs. MIL 2019-11-01 24.500000 25.060000 25.060000 29.914286 19.3

Now we need to pare this dataset down to the columns we are actually going to be using.

dataset = dataset[['Player', 'Last3', 'Last5', 'Last7', 'FP']]

Then we are going to take a peek at just the Aaron Gordon data and see if it came through correctly.

datasetAG = dataset[dataset['Player'] == 'Aaron Gordon']
datasetAG.head()
Player Last3 Last5 Last7 FP
0 Aaron Gordon 23.900000 23.900000 23.900000 23.9
1 Aaron Gordon 25.900000 25.900000 25.900000 27.9
2 Aaron Gordon 23.433333 23.433333 23.433333 18.5
3 Aaron Gordon 27.366667 26.500000 26.500000 35.7
4 Aaron Gordon 24.500000 25.060000 25.060000 19.3
datasetAG.tail()
Player Last3 Last5 Last7 FP
44 Aaron Gordon 41.800000 36.26 32.742857 45.0
45 Aaron Gordon 39.400000 36.02 32.628571 30.3
46 Aaron Gordon 34.533333 36.80 34.271429 28.3
47 Aaron Gordon 34.133333 38.06 36.028571 43.8
48 Aaron Gordon 42.133333 40.34 40.300000 54.3

Now a few terminology points here that you will want to become familiar with if you want to continue to learn any machine learning methods going forward:

  • Feaures

    • Features will generally refer to the data points you want your algorithm to learn from. This can consist of stats, location, or anything repeatable that you can provide for an event that has yet to happen that can be predicted on.

  • Labels

    • Labels will generally refer to what you are trying to predict. You will need labels for any supervised learning algorithm, as the entire point is to have fully fleshed out examples for the algorithm to learn how to predict. This can be fantasy points, fantasy points per minute, or whatever you want to actually predict.

Now we are going to go ahead and define our feature columns and label column, and just look at Aaron Gordon’s dataset for each.

featureNames = ['Last3', 'Last5', 'Last7']
labelName = ['FP']
dfFeatures = datasetAG[['Last3', 'Last5', 'Last7']]
dfFeatures.head()
Last3 Last5 Last7
0 23.900000 23.900000 23.900000
1 25.900000 25.900000 25.900000
2 23.433333 23.433333 23.433333
3 27.366667 26.500000 26.500000
4 24.500000 25.060000 25.060000
dfLabels = datasetAG[['FP']]
dfLabels.head()
FP
0 23.9
1 27.9
2 18.5
3 35.7
4 19.3

Now we will need to convert these pandas dataframes into numpy arrays.

  • Array

    • An array is essentially just a numerical dataframe. If you are familiar with your mathematical terminology another way to think of an array is a matrix. It can be numerical only so if you have text based data you want to incorporate you will need to hot encode it to numerical values to predict with.

labels = np.array(dfLabels)
features = np.array(dfFeatures)
labels
array([[23.9],
       [27.9],
       [18.5],
       [35.7],
       [19.3],
       [36.3],
       [25.9],
       [31.3],
       [25.8],
       [21.9],
       [48.1],
       [30.6],
       [26.8],
       [ 4.4],
       [29.6],
       [19.4],
       [34.2],
       [47.5],
       [27. ],
       [31.6],
       [39.8],
       [33.2],
       [17.8],
       [14. ],
       [25.9],
       [33.2],
       [45.2],
       [35.7],
       [12.4],
       [29.1],
       [31.4],
       [29.3],
       [24.4],
       [35.8],
       [34.7],
       [27.5],
       [19.1],
       [25.7],
       [31.1],
       [16.8],
       [31.5],
       [24.4],
       [37.5],
       [42.9],
       [45. ],
       [30.3],
       [28.3],
       [43.8],
       [54.3]])

Once we have our data converted to numpy arrays we are ready to start prepping for our machine learning algorithm. Now, there are two core concepts here when it comes to supervised learning, and that is the training dataset and the testing dataset.

  • Training dataset

    • The training dataset is going to be a subset of your total dataset, usually at least 50% of the data will be incorporated into the training dataset. This is how your algorithm ‘learns’. It is important not to have a small dataset or a dataset not representative of the whole, or you will risk overfitting your model to ONLY be accurate with this training dataset.

  • Testing Dataset

    • The training dataset is the dataset you test your model against to check for accuracy. This will be composed of the remainder of the dataset not used in the training data. It is imperative not to mix these 2 datasets or you will be stuck in an echo chamber of thinking your model is great because it is being tested against the same data it trained on.

I like to use at MINIMUM a 60/40 split for train/test datasets. If your data is likely to be very similar the points around it, it is recommended to shuffle your data prior to splitting it, as the following function will simply take the first 60% of datapoints to train, and the last 40% to train.

train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = 0)

Now, just for clarity I will display the train and test datasets so you can see the relative sizes and that they are not repeated.

train
array([[30.26666667, 29.88      , 33.01428571],
       [27.73333333, 22.88      , 27.58571429],
       [24.23333333, 25.9       , 25.15714286],
       [32.8       , 36.02      , 32.72857143],
       [31.13333333, 28.26      , 26.58571429],
       [27.66666667, 27.72      , 27.54285714],
       [20.6       , 26.36      , 26.98571429],
       [24.36666667, 24.82      , 27.92857143],
       [30.43333333, 27.54      , 26.93333333],
       [33.7       , 27.02      , 27.5       ],
       [32.66666667, 30.34      , 30.31428571],
       [20.26666667, 27.9       , 26.74285714],
       [25.3       , 27.62      , 28.32857143],
       [25.9       , 25.9       , 25.9       ],
       [35.16666667, 30.64      , 30.05714286],
       [34.93333333, 30.62      , 29.98571429],
       [19.23333333, 26.14      , 27.04285714],
       [27.16666667, 27.14      , 26.78571429],
       [21.66666667, 27.28      , 30.12857143],
       [27.1       , 28.3       , 28.88571429],
       [34.86666667, 35.82      , 33.24285714],
       [35.36666667, 31.94      , 27.67142857],
       [26.33333333, 28.24      , 28.02857143],
       [24.53333333, 24.04      , 27.24285714],
       [39.4       , 36.02      , 32.62857143],
       [27.36666667, 26.5       , 26.5       ],
       [23.9       , 23.9       , 23.9       ],
       [34.13333333, 38.06      , 36.02857143],
       [41.8       , 36.26      , 32.74285714]])
test
array([[25.73333333, 31.12      , 27.92857143],
       [24.5       , 25.06      , 25.06      ],
       [34.76666667, 27.22      , 29.87142857],
       [24.3       , 30.76      , 30.41428571],
       [28.36666667, 25.32      , 29.64285714],
       [24.1       , 28.56      , 28.07142857],
       [31.63333333, 31.12      , 28.15714286],
       [26.46666667, 24.84      , 26.62857143],
       [31.16666667, 29.7       , 27.84285714],
       [31.93333333, 30.6       , 29.8       ],
       [33.53333333, 31.54      , 31.41428571],
       [29.93333333, 27.58      , 30.9       ],
       [29.83333333, 30.        , 28.3       ],
       [38.03333333, 30.8       , 29.28571429],
       [42.13333333, 40.34      , 40.3       ],
       [23.43333333, 23.43333333, 23.43333333],
       [34.53333333, 36.8       , 34.27142857],
       [36.23333333, 31.54      , 26.98571429],
       [17.8       , 22.16      , 25.82857143],
       [31.1       , 30.48      , 26.31428571]])
trainLabels
array([[17.8],
       [34.2],
       [24.4],
       [39.8],
       [37.5],
       [25.8],
       [ 4.4],
       [33.2],
       [36.3],
       [47.5],
       [27.5],
       [29.6],
       [31.1],
       [27.9],
       [26.8],
       [42.9],
       [25.9],
       [25.9],
       [14. ],
       [19.1],
       [33.2],
       [31.6],
       [21.9],
       [16.8],
       [30.3],
       [35.7],
       [23.9],
       [43.8],
       [45. ]])
testLabels
array([[29.1],
       [19.3],
       [45.2],
       [31.4],
       [24.4],
       [25.7],
       [34.7],
       [31.5],
       [31.3],
       [48.1],
       [30.6],
       [29.3],
       [35.8],
       [35.7],
       [54.3],
       [18.5],
       [28.3],
       [27. ],
       [19.4],
       [12.4]])

Okay, now that we have segregated our datasets, we are ready to establish our decision tree.

tree = DecisionTreeRegressor(random_state=0)

Now that our decision tree is defined, we can go ahead and train it on our train data and take a look at the depth and number of nodes.

tree.fit(train, trainLabels)
print(f'Decision tree has {tree.tree_.node_count} nodes with maximum depth {tree.tree_.max_depth}.')
Decision tree has 57 nodes with maximum depth 8.

Now, I think it is important to note that this is not the ideal decision tree, but for the sake of this tutorial we will continue with it, while getting further into what makes a ‘good’ or ‘bad’ tree in future tutorials.

Moving forward, we have our decision tree, now we are ready to take a look at the prediction it will put out and analyze the results.

train_predictions = tree.predict(train)
predictions = tree.predict(test)
df3 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df3.describe()
Last3 Last5 Last7
count 20.000000 20.000000 20.000000
mean 29.976667 29.448667 29.022524
std 5.795775 4.288313 3.604194
min 17.800000 22.160000 23.433333
25% 25.425000 26.745000 26.896429
50% 30.516667 30.240000 28.228571
75% 33.783333 31.120000 30.007143
max 42.133333 40.340000 40.300000
df3['actual'] = testLabels
df3['predicted'] = predictions
df3['error'] = abs(df3['actual'] - df3['predicted'])
df3.head()
Last3 Last5 Last7 actual predicted error
0 25.733333 31.12 27.928571 29.1 31.1 2.0
1 24.500000 25.06 25.060000 19.3 24.4 5.1
2 34.766667 27.22 29.871429 45.2 47.5 2.3
3 24.300000 30.76 30.414286 31.4 14.0 17.4
4 28.366667 25.32 29.642857 24.4 19.1 5.3

As we can see in the ‘error’ column above, there are some pretty big discrepancies between the actual score and the predicted score for that particular game! And that’s OKAY! No model is perfect and never will be, if everything could be predicted perfectly what would be the point in living right? Anyways, all that means is that we have some work to do, which we already knew before we saw those numbers.

One way to possibly increase the accuracy of our model is to standardize our data input prior to training our data. Standardizing our data will make for a smaller range of data to work with, while still maintaining the relative difference in scores. This generally increases the accuracy of the model as there are fewer big swings in data to work with. Let’s give it a shot and see if it makes a difference.

x = features
x= StandardScaler().fit_transform(x)
x
array([[-0.96346636, -1.28665376, -1.59744282],
       [-0.61176645, -0.79589584, -0.95444901],
       [-1.04552968, -1.40116394, -1.74747471],
       [-0.35385318, -0.64866846, -0.76155087],
       [-0.85795639, -1.00201416, -1.22450641],
       [ 0.18542002, -0.39347434, -0.62223554],
       [-0.38902317, -0.49162593, -0.66969461],
       [ 0.31437666,  0.13654421, -0.32982645],
       [-0.30109819, -0.34930613, -0.42627553],
       [-0.5355648 , -0.22170907, -0.27011989],
       [ 0.44919496,  0.35738527,  0.29938891],
       [ 0.73055489,  0.5880415 ,  0.81837677],
       [ 1.01777649,  0.36720043,  0.38205955],
       [-1.54377122, -0.68302152, -0.60539523],
       [-1.60238787, -0.30513792, -0.68347305],
       [-2.0361511 , -1.71361315, -0.97741307],
       [-0.28937486, -1.5369403 , -0.41249709],
       [ 0.75986322, -0.5210714 , -0.44005396],
       [ 1.20534977,  0.5880415 , -0.60539523],
       [ 1.05294648,  0.68619308, -0.38494021],
       [ 0.60159825,  1.68733924,  1.24091556],
       [ 0.9650215 ,  1.63826344,  1.40625683],
       [ 0.1561117 ,  0.18071242,  1.33277182],
       [-1.35619793, -0.45727287,  0.40502361],
       [-1.7840995 , -0.73700489, -0.58702398],
       [-0.88140305, -1.06090511, -0.30226958],
       [ 0.9474365 , -0.47199561,  0.32235298],
       [ 1.5218797 ,  0.40646107,  0.13404765],
       [ 0.30265333,  0.3279398 , -0.82125744],
       [-0.64107477,  0.48498233, -0.30226958],
       [-0.89312638,  0.39664591,  0.49687987],
       [ 0.09749504, -0.38365918,  0.65303551],
       [-0.17800322, -0.93821563,  0.24886797],
       [ 0.07991005,  0.2101579 , -0.18285644],
       [ 0.39643997,  0.48498233, -0.22878457],
       [ 0.57815159,  0.29358674,  0.46473018],
       [-0.4007465 , -0.20698633,  0.00544889],
       [-0.92829637, -0.1431878 , -0.25634145],
       [-0.71727642, -0.37384403, -0.17367082],
       [-0.85209472, -1.2523007 , -0.5227246 ],
       [-0.51211814, -1.05599753, -0.72021555],
       [-0.90484971, -0.79589584, -1.19327528],
       [ 0.30851499, -0.21680149, -0.73399399],
       [ 0.97674483,  0.36229285,  0.35909548],
       [ 2.18424787,  1.74623019,  1.24550837],
       [ 1.76220797,  1.68733924,  1.20876587],
       [ 0.90640485,  1.87873482,  1.73693936],
       [ 0.83606486,  2.18791231,  2.30185534],
       [ 2.24286452,  2.74737634,  3.6751064 ]])
train, test, trainLabels, testLabels = train_test_split(x, labels, test_size=(0.4), random_state = 0)
reg = DecisionTreeRegressor(random_state=50)
reg.fit(train, trainLabels)
DecisionTreeRegressor(random_state=50)
train_predictions = reg.predict(train)
predictions = reg.predict(test)
df4 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df4['actual'] = testLabels
df4['predicted'] = predictions
df4['error'] = abs(df4['actual'] - df4['predicted'])

There’s a lot to digest in that code block, but it’s pretty much the exact same process as before, just with the standardization occurring at the beginning. So now we can compare the two outcomes and see how different methods may produce different results.

df3.describe()
Last3 Last5 Last7 actual predicted error
count 20.000000 20.000000 20.000000 20.000000 20.000000 20.00000
mean 29.976667 29.448667 29.022524 30.600000 29.150000 7.96000
std 5.795775 4.288313 3.604194 10.178202 8.367953 5.58432
min 17.800000 22.160000 23.433333 12.400000 14.000000 1.90000
25% 25.425000 26.745000 26.896429 25.375000 25.450000 4.02500
50% 30.516667 30.240000 28.228571 29.950000 27.500000 5.40000
75% 33.783333 31.120000 30.007143 34.950000 31.225000 10.27500
max 42.133333 40.340000 40.300000 54.300000 47.500000 20.60000
df4.describe()
Last3 Last5 Last7 actual predicted error
count 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000
mean 0.105115 0.074872 0.049433 30.600000 29.105000 8.095000
std 1.019187 1.052262 1.158737 10.178202 8.498203 5.387166
min -2.036151 -1.713613 -1.747475 12.400000 14.000000 2.000000
25% -0.695295 -0.588551 -0.634100 25.375000 25.450000 4.025000
50% 0.200074 0.269049 -0.205821 29.950000 27.500000 6.300000
75% 0.774517 0.484982 0.365985 34.950000 31.600000 10.375000
max 2.242865 2.747376 3.675106 54.300000 47.500000 20.600000

As we can see here, in this specific instance, standardizing the data did not have a very large impact on our end result. Something you may want to try in the future is altering the random state parameter to try to create a more accurate model.

Previous
Previous

Pickling our Models for Future Use

Next
Next

Identifying NBA Player Archetypes Using K-Means Clustering — Part Two