Pickling our Models for Future Use

Introduction

To start with we will revisit the code from our previous lesson where we created a predictive model for each player cluster. I have cleaned up the code a bit and removed the unnecessary bits for clarity's sake here, but the following code is pulled straight from the previous video.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeRegressor
dataset = pd.read_excel("SampleDataForYT.xlsx")
dataset['AveDiff'] = dataset['FP'] - dataset['SeasonAve']
clusterdf = pd.read_excel(r"C:\Users\nfwya\OneDrive\Youtube\playerClusterNew2020.xlsx")
clusterDict = {}
newDict = pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict()
clusterDict.update(pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict())
dataset['Cluster'] = dataset['Player']
dataset['Cluster'] = dataset['Cluster'].replace(clusterDict)
dataset.head()
Player Match_Up Game_Date FP Last3 Last5 Last7 SeasonAve AveDiff Cluster
0 Abdel Nader OKC @ NOP 2020-02-13 19.3 11.433333 10.46 9.728571 11.038235 8.261765 12
1 Brad Wanamaker BOS vs. LAC 2020-02-13 16.7 18.066667 17.70 21.757143 15.162745 1.537255 0
2 Chris Paul OKC @ NOP 2020-02-13 46.6 44.333333 40.16 40.485714 36.553704 10.046296 7
3 Daniel Theis BOS vs. LAC 2020-02-13 25.0 28.333333 22.06 23.728571 23.464583 1.535417 11
4 Danilo Gallinari OKC @ NOP 2020-02-13 36.9 32.500000 31.56 30.914286 30.508511 6.391489 9
dataset.describe()
FP Last3 Last5 Last7 SeasonAve AveDiff Cluster
count 14397.000000 14397.000000 14397.000000 14397.000000 14397.000000 1.439700e+04 14397.000000
mean 23.971800 23.933227 23.890260 23.857880 23.971800 1.263450e-16 6.440231
std 13.789482 11.611917 11.107322 10.883677 9.991757 9.503400e+00 4.018088
min -2.000000 -1.000000 -1.000000 -1.000000 7.640625 -3.663000e+01 0.000000
25% 13.500000 15.333333 15.720000 15.866667 16.668519 -6.486538e+00 3.000000
50% 22.100000 22.166667 22.040000 21.985714 22.050000 -6.037037e-01 6.000000
75% 32.500000 30.766667 30.260000 30.028571 29.914286 5.965116e+00 9.000000
max 88.400000 81.600000 81.600000 81.600000 57.629167 4.685357e+01 12.000000
clusterList = clusterdf['Cluster'].tolist()
uniqueClusterList = list(set(clusterList))
uniqueClusterList
[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12]
for cluster in uniqueClusterList:
    clusterData = dataset[dataset['Cluster'] == cluster]
    dfFeatures = clusterData[['Last3', 'Last5', 'Last7']]
    dfLabels = clusterData[['FP']]
    labels = np.array(dfLabels)
    features = np.array(dfFeatures)
    train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = 10)
    reg = DecisionTreeRegressor(random_state=10)
    reg.fit(train, trainLabels)
    train_predictions = reg.predict(train)
    predictions = reg.predict(test)
    df0 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
    df0['actual'] = testLabels
    df0['predicted'] = predictions
    df0['error'] = abs(df0['actual'] - df0['predicted'])
    print(f"Cluster {cluster} average error is {df0['error'].mean()}")
Cluster 0 average error is 8.155613850996852
Cluster 1 average error is 6.072222222222222
Cluster 2 average error is 6.120833333333333
Cluster 3 average error is 9.728318584070795
Cluster 4 average error is 10.060000000000002
Cluster 5 average error is 10.058653846153845
Cluster 6 average error is 7.592279655400928
Cluster 7 average error is 9.32463556851312
Cluster 9 average error is 8.865070422535211
Cluster 11 average error is 8.125809352517987
Cluster 12 average error is 6.854464285714286

Moving Forward

Now that we have reviewed how to create a model, and how to create them rapidly in a batched process (for loop), we are going to go over how to save a model for later use.

This is a very beneficial process for all modeling purposes, if you have a model that performs well and you need to re-run the model to incorporate updated information (think re-running the cluster analysis every month to make sure your clusters stay up to date with playstyle changes etc.) you don't want to risk losing any performance by retraining your model to run again. Additionally, if you are going to run a predictive model for fantasy point projections, you aren't going to want to train and test your model on your full dataset daily just to get an output.

Pickling

The pythonic process we will be utilizing for this need is called pickling. This process in python is very similar to pickling things in real life, and is likely where the name came from. Pickling can be done for anything you want to save for later in python, while we will be utilizing it for our model here, you could pickle a dictionary to be called at a later date. I like to use this to save data dictionaries for consistent use for name changes from where stats get pulled from and the DK and FD player lists. This way I don't have to be continually checking for them and manually replacing the differences.

First, let's go over how to pickle something python. Step 1 is going to be importing the pickle module. No pip install is necessary for this one, as pickle comes in the default python2 and python3 distribution.

import pickle

Now, to pickle an item, we need to first specify and open a text file where we want to save the information at, and provide write privileges to python to actually write to and save the file:

(You will need to have ran the previous bits of code above from the last session at least until the uniqueClusterVariable is defined for this work. Alternatively, you can define your own list to save and check with.)

with open(file="mytextfile.txt", mode="wb") as myfile:
    pickle.dump(uniqueClusterList, myfile)

Now that we have opened our text file, and pickled our uniqe cluster list into the text file, we can do some testing to see how it works. Right now, our cluster list is saved in memory since we have defined it in this kernel session, as shown below:

uniqueClusterList
[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12]

Now, we will redefine the uniqueClusterList to be empty, and then check it to ensure it is no longer defined:

uniqueClusterList = []
uniqueClusterList
[]

Now that we have established that our list is empty, we are going to un-pickle our pickled list file, and re-assign the list value from our pickled file. To do this we will first open our previously saved text file with python with read only privileges, then we will load our pickled data into a variable of our choosing and verify it assigns properly.

with open("mytextfile.txt", mode="rb") as myfile:
    uniqueClusterList = pickle.load(myfile)

Now that we previously verifed the uniqueClusterList variable is an empty list, when we call it after loading our pickled list into that variable, it should now be the original list. Let's check and see!

uniqueClusterList
[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12]

Now that we have established how to store something by pickling, and access data that has been pickled, this exact same process can be utilized for most anything in python you wish to store for later. For convenience, we will go ahead and step through saving and re-accessing a model to show how it works in that capacity as well.

First we will go ahead and copy some code from above to establish our model:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeRegressor

dataset = pd.read_excel("SampleDataForYT.xlsx")
dataset['AveDiff'] = dataset['FP'] - dataset['SeasonAve']
clusterdf = pd.read_excel(r"C:\Users\nfwya\OneDrive\Youtube\playerClusterNew2020.xlsx")
clusterDict = {}
newDict = pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict()
clusterDict.update(pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict())
dataset['Cluster'] = dataset['Player']
dataset['Cluster'] = dataset['Cluster'].replace(clusterDict)

Now, we will simply isolate one cluster for use in this example:

clusterData = dataset[dataset['Cluster'] == 7]

Next up we will go through the process of creating, training, and testing our model:

dfFeatures = clusterData[['Last3', 'Last5', 'Last7']]
dfLabels = clusterData[['FP']]
labels = np.array(dfLabels)
features = np.array(dfFeatures)
train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = 10)
reg = DecisionTreeRegressor(random_state=10)
reg.fit(train, trainLabels)
train_predictions = reg.predict(train)
predictions = reg.predict(test)
df0 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df0['actual'] = testLabels
df0['predicted'] = predictions
df0['error'] = abs(df0['actual'] - df0['predicted'])
print(f"Cluster 7 average error is {df0['error'].mean()}")
Cluster 7 average error is 9.32463556851312

Next, we will pickle our model using the same method as before

with open(file="mymodelfile.txt", mode="wb") as myfile:
    pickle.dump(reg, myfile)

After this, we will skip the clearing of the reg variable as we had done previously, and we will simply assign the model to a new variable name for testing purposes.

with open(file='mymodelfile.txt', mode='rb') as myfile:
    newModel = pickle.load(myfile)

Finally we will run this new model over our same cluster 7 data and we should get the same result if we performed everything correctly. Notice we will not need to train, test, or fit this model to our data, as it is already prepped and ready to go in the same condition we pickled it in. We will however be using the same train/test datasets since that is we used previously, and they are still broken out in the same manner, so do not be confused by seeing the words train and test below, as they merely represent the 60/40 data splits from earlier.

train_predictions = newModel.predict(train)
predictions = newModel.predict(test)
df0 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df0['actual'] = testLabels
df0['predicted'] = predictions
df0['error'] = abs(df0['actual'] - df0['predicted'])
print(f"Cluster 7 average error is {df0['error'].mean()}")
Cluster 7 average error is 9.32463556851312

And that's all there is to it folks, you can use pickling across a wide variety of applications, and it can really save you a lot of time.

Previous
Previous

Monte Carlo-esque Simulation Introduction

Next
Next

Intro to Predicting Fantasy Sports Scores