Intro to Predicting Fantasy Sports Scores
We’re finally here, ready to start going over the basics to predicting fantasy sports scores!
We are going to run through a very basic example of using a single decision tree regression analysis to go over some of the basics and terminology in this tutorial.
To start with we need to go ahead and import all our modules we will be using today. As usual, if you have not used any of these modules before you will need to pip install them to have access.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
Next up we will need to import our data into a pandas dataframe, my dataset is going to be a pretty basic dataset using the rolling averages from the last 3, 5, and 7 games from some point in the 2019-2020 season.
dataset = pd.read_excel("SampleDataForYT.xlsx")
dataset.head()
Player | Match_Up | Game_Date | Last3 | Last5 | Last7 | SeasonAve | FP | |
---|---|---|---|---|---|---|---|---|
0 | Aaron Gordon | ORL vs. CLE | 2019-10-23 | 23.900000 | 23.900000 | 23.900000 | 29.914286 | 23.9 |
1 | Aaron Gordon | ORL @ ATL | 2019-10-26 | 25.900000 | 25.900000 | 25.900000 | 29.914286 | 27.9 |
2 | Aaron Gordon | ORL @ TOR | 2019-10-28 | 23.433333 | 23.433333 | 23.433333 | 29.914286 | 18.5 |
3 | Aaron Gordon | ORL vs. NYK | 2019-10-30 | 27.366667 | 26.500000 | 26.500000 | 29.914286 | 35.7 |
4 | Aaron Gordon | ORL vs. MIL | 2019-11-01 | 24.500000 | 25.060000 | 25.060000 | 29.914286 | 19.3 |
Now we need to pare this dataset down to the columns we are actually going to be using.
dataset = dataset[['Player', 'Last3', 'Last5', 'Last7', 'FP']]
Then we are going to take a peek at just the Aaron Gordon data and see if it came through correctly.
datasetAG = dataset[dataset['Player'] == 'Aaron Gordon']
datasetAG.head()
Player | Last3 | Last5 | Last7 | FP | |
---|---|---|---|---|---|
0 | Aaron Gordon | 23.900000 | 23.900000 | 23.900000 | 23.9 |
1 | Aaron Gordon | 25.900000 | 25.900000 | 25.900000 | 27.9 |
2 | Aaron Gordon | 23.433333 | 23.433333 | 23.433333 | 18.5 |
3 | Aaron Gordon | 27.366667 | 26.500000 | 26.500000 | 35.7 |
4 | Aaron Gordon | 24.500000 | 25.060000 | 25.060000 | 19.3 |
datasetAG.tail()
Player | Last3 | Last5 | Last7 | FP | |
---|---|---|---|---|---|
44 | Aaron Gordon | 41.800000 | 36.26 | 32.742857 | 45.0 |
45 | Aaron Gordon | 39.400000 | 36.02 | 32.628571 | 30.3 |
46 | Aaron Gordon | 34.533333 | 36.80 | 34.271429 | 28.3 |
47 | Aaron Gordon | 34.133333 | 38.06 | 36.028571 | 43.8 |
48 | Aaron Gordon | 42.133333 | 40.34 | 40.300000 | 54.3 |
Now a few terminology points here that you will want to become familiar with if you want to continue to learn any machine learning methods going forward:
Feaures
Features will generally refer to the data points you want your algorithm to learn from. This can consist of stats, location, or anything repeatable that you can provide for an event that has yet to happen that can be predicted on.
Labels
Labels will generally refer to what you are trying to predict. You will need labels for any supervised learning algorithm, as the entire point is to have fully fleshed out examples for the algorithm to learn how to predict. This can be fantasy points, fantasy points per minute, or whatever you want to actually predict.
Now we are going to go ahead and define our feature columns and label column, and just look at Aaron Gordon’s dataset for each.
featureNames = ['Last3', 'Last5', 'Last7']
labelName = ['FP']
dfFeatures = datasetAG[['Last3', 'Last5', 'Last7']]
dfFeatures.head()
Last3 | Last5 | Last7 | |
---|---|---|---|
0 | 23.900000 | 23.900000 | 23.900000 |
1 | 25.900000 | 25.900000 | 25.900000 |
2 | 23.433333 | 23.433333 | 23.433333 |
3 | 27.366667 | 26.500000 | 26.500000 |
4 | 24.500000 | 25.060000 | 25.060000 |
dfLabels = datasetAG[['FP']]
dfLabels.head()
FP | |
---|---|
0 | 23.9 |
1 | 27.9 |
2 | 18.5 |
3 | 35.7 |
4 | 19.3 |
Now we will need to convert these pandas dataframes into numpy arrays.
Array
An array is essentially just a numerical dataframe. If you are familiar with your mathematical terminology another way to think of an array is a matrix. It can be numerical only so if you have text based data you want to incorporate you will need to hot encode it to numerical values to predict with.
labels = np.array(dfLabels)
features = np.array(dfFeatures)
labels
array([[23.9],
[27.9],
[18.5],
[35.7],
[19.3],
[36.3],
[25.9],
[31.3],
[25.8],
[21.9],
[48.1],
[30.6],
[26.8],
[ 4.4],
[29.6],
[19.4],
[34.2],
[47.5],
[27. ],
[31.6],
[39.8],
[33.2],
[17.8],
[14. ],
[25.9],
[33.2],
[45.2],
[35.7],
[12.4],
[29.1],
[31.4],
[29.3],
[24.4],
[35.8],
[34.7],
[27.5],
[19.1],
[25.7],
[31.1],
[16.8],
[31.5],
[24.4],
[37.5],
[42.9],
[45. ],
[30.3],
[28.3],
[43.8],
[54.3]])
Once we have our data converted to numpy arrays we are ready to start prepping for our machine learning algorithm. Now, there are two core concepts here when it comes to supervised learning, and that is the training dataset and the testing dataset.
Training dataset
The training dataset is going to be a subset of your total dataset, usually at least 50% of the data will be incorporated into the training dataset. This is how your algorithm ‘learns’. It is important not to have a small dataset or a dataset not representative of the whole, or you will risk overfitting your model to ONLY be accurate with this training dataset.
Testing Dataset
The training dataset is the dataset you test your model against to check for accuracy. This will be composed of the remainder of the dataset not used in the training data. It is imperative not to mix these 2 datasets or you will be stuck in an echo chamber of thinking your model is great because it is being tested against the same data it trained on.
I like to use at MINIMUM a 60/40 split for train/test datasets. If your data is likely to be very similar the points around it, it is recommended to shuffle your data prior to splitting it, as the following function will simply take the first 60% of datapoints to train, and the last 40% to train.
train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = 0)
Now, just for clarity I will display the train and test datasets so you can see the relative sizes and that they are not repeated.
train
array([[30.26666667, 29.88 , 33.01428571],
[27.73333333, 22.88 , 27.58571429],
[24.23333333, 25.9 , 25.15714286],
[32.8 , 36.02 , 32.72857143],
[31.13333333, 28.26 , 26.58571429],
[27.66666667, 27.72 , 27.54285714],
[20.6 , 26.36 , 26.98571429],
[24.36666667, 24.82 , 27.92857143],
[30.43333333, 27.54 , 26.93333333],
[33.7 , 27.02 , 27.5 ],
[32.66666667, 30.34 , 30.31428571],
[20.26666667, 27.9 , 26.74285714],
[25.3 , 27.62 , 28.32857143],
[25.9 , 25.9 , 25.9 ],
[35.16666667, 30.64 , 30.05714286],
[34.93333333, 30.62 , 29.98571429],
[19.23333333, 26.14 , 27.04285714],
[27.16666667, 27.14 , 26.78571429],
[21.66666667, 27.28 , 30.12857143],
[27.1 , 28.3 , 28.88571429],
[34.86666667, 35.82 , 33.24285714],
[35.36666667, 31.94 , 27.67142857],
[26.33333333, 28.24 , 28.02857143],
[24.53333333, 24.04 , 27.24285714],
[39.4 , 36.02 , 32.62857143],
[27.36666667, 26.5 , 26.5 ],
[23.9 , 23.9 , 23.9 ],
[34.13333333, 38.06 , 36.02857143],
[41.8 , 36.26 , 32.74285714]])
test
array([[25.73333333, 31.12 , 27.92857143],
[24.5 , 25.06 , 25.06 ],
[34.76666667, 27.22 , 29.87142857],
[24.3 , 30.76 , 30.41428571],
[28.36666667, 25.32 , 29.64285714],
[24.1 , 28.56 , 28.07142857],
[31.63333333, 31.12 , 28.15714286],
[26.46666667, 24.84 , 26.62857143],
[31.16666667, 29.7 , 27.84285714],
[31.93333333, 30.6 , 29.8 ],
[33.53333333, 31.54 , 31.41428571],
[29.93333333, 27.58 , 30.9 ],
[29.83333333, 30. , 28.3 ],
[38.03333333, 30.8 , 29.28571429],
[42.13333333, 40.34 , 40.3 ],
[23.43333333, 23.43333333, 23.43333333],
[34.53333333, 36.8 , 34.27142857],
[36.23333333, 31.54 , 26.98571429],
[17.8 , 22.16 , 25.82857143],
[31.1 , 30.48 , 26.31428571]])
trainLabels
array([[17.8],
[34.2],
[24.4],
[39.8],
[37.5],
[25.8],
[ 4.4],
[33.2],
[36.3],
[47.5],
[27.5],
[29.6],
[31.1],
[27.9],
[26.8],
[42.9],
[25.9],
[25.9],
[14. ],
[19.1],
[33.2],
[31.6],
[21.9],
[16.8],
[30.3],
[35.7],
[23.9],
[43.8],
[45. ]])
testLabels
array([[29.1],
[19.3],
[45.2],
[31.4],
[24.4],
[25.7],
[34.7],
[31.5],
[31.3],
[48.1],
[30.6],
[29.3],
[35.8],
[35.7],
[54.3],
[18.5],
[28.3],
[27. ],
[19.4],
[12.4]])
Okay, now that we have segregated our datasets, we are ready to establish our decision tree.
tree = DecisionTreeRegressor(random_state=0)
Now that our decision tree is defined, we can go ahead and train it on our train data and take a look at the depth and number of nodes.
tree.fit(train, trainLabels)
print(f'Decision tree has {tree.tree_.node_count} nodes with maximum depth {tree.tree_.max_depth}.')
Decision tree has 57 nodes with maximum depth 8.
Now, I think it is important to note that this is not the ideal decision tree, but for the sake of this tutorial we will continue with it, while getting further into what makes a ‘good’ or ‘bad’ tree in future tutorials.
Moving forward, we have our decision tree, now we are ready to take a look at the prediction it will put out and analyze the results.
train_predictions = tree.predict(train)
predictions = tree.predict(test)
df3 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df3.describe()
Last3 | Last5 | Last7 | |
---|---|---|---|
count | 20.000000 | 20.000000 | 20.000000 |
mean | 29.976667 | 29.448667 | 29.022524 |
std | 5.795775 | 4.288313 | 3.604194 |
min | 17.800000 | 22.160000 | 23.433333 |
25% | 25.425000 | 26.745000 | 26.896429 |
50% | 30.516667 | 30.240000 | 28.228571 |
75% | 33.783333 | 31.120000 | 30.007143 |
max | 42.133333 | 40.340000 | 40.300000 |
df3['actual'] = testLabels
df3['predicted'] = predictions
df3['error'] = abs(df3['actual'] - df3['predicted'])
df3.head()
Last3 | Last5 | Last7 | actual | predicted | error | |
---|---|---|---|---|---|---|
0 | 25.733333 | 31.12 | 27.928571 | 29.1 | 31.1 | 2.0 |
1 | 24.500000 | 25.06 | 25.060000 | 19.3 | 24.4 | 5.1 |
2 | 34.766667 | 27.22 | 29.871429 | 45.2 | 47.5 | 2.3 |
3 | 24.300000 | 30.76 | 30.414286 | 31.4 | 14.0 | 17.4 |
4 | 28.366667 | 25.32 | 29.642857 | 24.4 | 19.1 | 5.3 |
As we can see in the ‘error’ column above, there are some pretty big discrepancies between the actual score and the predicted score for that particular game! And that’s OKAY! No model is perfect and never will be, if everything could be predicted perfectly what would be the point in living right? Anyways, all that means is that we have some work to do, which we already knew before we saw those numbers.
One way to possibly increase the accuracy of our model is to standardize our data input prior to training our data. Standardizing our data will make for a smaller range of data to work with, while still maintaining the relative difference in scores. This generally increases the accuracy of the model as there are fewer big swings in data to work with. Let’s give it a shot and see if it makes a difference.
x = features
x= StandardScaler().fit_transform(x)
x
array([[-0.96346636, -1.28665376, -1.59744282],
[-0.61176645, -0.79589584, -0.95444901],
[-1.04552968, -1.40116394, -1.74747471],
[-0.35385318, -0.64866846, -0.76155087],
[-0.85795639, -1.00201416, -1.22450641],
[ 0.18542002, -0.39347434, -0.62223554],
[-0.38902317, -0.49162593, -0.66969461],
[ 0.31437666, 0.13654421, -0.32982645],
[-0.30109819, -0.34930613, -0.42627553],
[-0.5355648 , -0.22170907, -0.27011989],
[ 0.44919496, 0.35738527, 0.29938891],
[ 0.73055489, 0.5880415 , 0.81837677],
[ 1.01777649, 0.36720043, 0.38205955],
[-1.54377122, -0.68302152, -0.60539523],
[-1.60238787, -0.30513792, -0.68347305],
[-2.0361511 , -1.71361315, -0.97741307],
[-0.28937486, -1.5369403 , -0.41249709],
[ 0.75986322, -0.5210714 , -0.44005396],
[ 1.20534977, 0.5880415 , -0.60539523],
[ 1.05294648, 0.68619308, -0.38494021],
[ 0.60159825, 1.68733924, 1.24091556],
[ 0.9650215 , 1.63826344, 1.40625683],
[ 0.1561117 , 0.18071242, 1.33277182],
[-1.35619793, -0.45727287, 0.40502361],
[-1.7840995 , -0.73700489, -0.58702398],
[-0.88140305, -1.06090511, -0.30226958],
[ 0.9474365 , -0.47199561, 0.32235298],
[ 1.5218797 , 0.40646107, 0.13404765],
[ 0.30265333, 0.3279398 , -0.82125744],
[-0.64107477, 0.48498233, -0.30226958],
[-0.89312638, 0.39664591, 0.49687987],
[ 0.09749504, -0.38365918, 0.65303551],
[-0.17800322, -0.93821563, 0.24886797],
[ 0.07991005, 0.2101579 , -0.18285644],
[ 0.39643997, 0.48498233, -0.22878457],
[ 0.57815159, 0.29358674, 0.46473018],
[-0.4007465 , -0.20698633, 0.00544889],
[-0.92829637, -0.1431878 , -0.25634145],
[-0.71727642, -0.37384403, -0.17367082],
[-0.85209472, -1.2523007 , -0.5227246 ],
[-0.51211814, -1.05599753, -0.72021555],
[-0.90484971, -0.79589584, -1.19327528],
[ 0.30851499, -0.21680149, -0.73399399],
[ 0.97674483, 0.36229285, 0.35909548],
[ 2.18424787, 1.74623019, 1.24550837],
[ 1.76220797, 1.68733924, 1.20876587],
[ 0.90640485, 1.87873482, 1.73693936],
[ 0.83606486, 2.18791231, 2.30185534],
[ 2.24286452, 2.74737634, 3.6751064 ]])
train, test, trainLabels, testLabels = train_test_split(x, labels, test_size=(0.4), random_state = 0)
reg = DecisionTreeRegressor(random_state=50)
reg.fit(train, trainLabels)
DecisionTreeRegressor(random_state=50)
train_predictions = reg.predict(train)
predictions = reg.predict(test)
df4 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df4['actual'] = testLabels
df4['predicted'] = predictions
df4['error'] = abs(df4['actual'] - df4['predicted'])
There’s a lot to digest in that code block, but it’s pretty much the exact same process as before, just with the standardization occurring at the beginning. So now we can compare the two outcomes and see how different methods may produce different results.
df3.describe()
Last3 | Last5 | Last7 | actual | predicted | error | |
---|---|---|---|---|---|---|
count | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.00000 |
mean | 29.976667 | 29.448667 | 29.022524 | 30.600000 | 29.150000 | 7.96000 |
std | 5.795775 | 4.288313 | 3.604194 | 10.178202 | 8.367953 | 5.58432 |
min | 17.800000 | 22.160000 | 23.433333 | 12.400000 | 14.000000 | 1.90000 |
25% | 25.425000 | 26.745000 | 26.896429 | 25.375000 | 25.450000 | 4.02500 |
50% | 30.516667 | 30.240000 | 28.228571 | 29.950000 | 27.500000 | 5.40000 |
75% | 33.783333 | 31.120000 | 30.007143 | 34.950000 | 31.225000 | 10.27500 |
max | 42.133333 | 40.340000 | 40.300000 | 54.300000 | 47.500000 | 20.60000 |
df4.describe()
Last3 | Last5 | Last7 | actual | predicted | error | |
---|---|---|---|---|---|---|
count | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 |
mean | 0.105115 | 0.074872 | 0.049433 | 30.600000 | 29.105000 | 8.095000 |
std | 1.019187 | 1.052262 | 1.158737 | 10.178202 | 8.498203 | 5.387166 |
min | -2.036151 | -1.713613 | -1.747475 | 12.400000 | 14.000000 | 2.000000 |
25% | -0.695295 | -0.588551 | -0.634100 | 25.375000 | 25.450000 | 4.025000 |
50% | 0.200074 | 0.269049 | -0.205821 | 29.950000 | 27.500000 | 6.300000 |
75% | 0.774517 | 0.484982 | 0.365985 | 34.950000 | 31.600000 | 10.375000 |
max | 2.242865 | 2.747376 | 3.675106 | 54.300000 | 47.500000 | 20.600000 |
As we can see here, in this specific instance, standardizing the data did not have a very large impact on our end result. Something you may want to try in the future is altering the random state parameter to try to create a more accurate model.