Pickling our Models for Future Use
Introduction
To start with we will revisit the code from our previous lesson where we created a predictive model for each player cluster. I have cleaned up the code a bit and removed the unnecessary bits for clarity's sake here, but the following code is pulled straight from the previous video.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
dataset = pd.read_excel("SampleDataForYT.xlsx")
dataset['AveDiff'] = dataset['FP'] - dataset['SeasonAve']
clusterdf = pd.read_excel(r"C:\Users\nfwya\OneDrive\Youtube\playerClusterNew2020.xlsx")
clusterDict = {}
newDict = pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict()
clusterDict.update(pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict())
dataset['Cluster'] = dataset['Player']
dataset['Cluster'] = dataset['Cluster'].replace(clusterDict)
dataset.head()
Player | Match_Up | Game_Date | FP | Last3 | Last5 | Last7 | SeasonAve | AveDiff | Cluster | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Abdel Nader | OKC @ NOP | 2020-02-13 | 19.3 | 11.433333 | 10.46 | 9.728571 | 11.038235 | 8.261765 | 12 |
1 | Brad Wanamaker | BOS vs. LAC | 2020-02-13 | 16.7 | 18.066667 | 17.70 | 21.757143 | 15.162745 | 1.537255 | 0 |
2 | Chris Paul | OKC @ NOP | 2020-02-13 | 46.6 | 44.333333 | 40.16 | 40.485714 | 36.553704 | 10.046296 | 7 |
3 | Daniel Theis | BOS vs. LAC | 2020-02-13 | 25.0 | 28.333333 | 22.06 | 23.728571 | 23.464583 | 1.535417 | 11 |
4 | Danilo Gallinari | OKC @ NOP | 2020-02-13 | 36.9 | 32.500000 | 31.56 | 30.914286 | 30.508511 | 6.391489 | 9 |
dataset.describe()
FP | Last3 | Last5 | Last7 | SeasonAve | AveDiff | Cluster | |
---|---|---|---|---|---|---|---|
count | 14397.000000 | 14397.000000 | 14397.000000 | 14397.000000 | 14397.000000 | 1.439700e+04 | 14397.000000 |
mean | 23.971800 | 23.933227 | 23.890260 | 23.857880 | 23.971800 | 1.263450e-16 | 6.440231 |
std | 13.789482 | 11.611917 | 11.107322 | 10.883677 | 9.991757 | 9.503400e+00 | 4.018088 |
min | -2.000000 | -1.000000 | -1.000000 | -1.000000 | 7.640625 | -3.663000e+01 | 0.000000 |
25% | 13.500000 | 15.333333 | 15.720000 | 15.866667 | 16.668519 | -6.486538e+00 | 3.000000 |
50% | 22.100000 | 22.166667 | 22.040000 | 21.985714 | 22.050000 | -6.037037e-01 | 6.000000 |
75% | 32.500000 | 30.766667 | 30.260000 | 30.028571 | 29.914286 | 5.965116e+00 | 9.000000 |
max | 88.400000 | 81.600000 | 81.600000 | 81.600000 | 57.629167 | 4.685357e+01 | 12.000000 |
clusterList = clusterdf['Cluster'].tolist()
uniqueClusterList = list(set(clusterList))
uniqueClusterList
[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12]
for cluster in uniqueClusterList:
clusterData = dataset[dataset['Cluster'] == cluster]
dfFeatures = clusterData[['Last3', 'Last5', 'Last7']]
dfLabels = clusterData[['FP']]
labels = np.array(dfLabels)
features = np.array(dfFeatures)
train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = 10)
reg = DecisionTreeRegressor(random_state=10)
reg.fit(train, trainLabels)
train_predictions = reg.predict(train)
predictions = reg.predict(test)
df0 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df0['actual'] = testLabels
df0['predicted'] = predictions
df0['error'] = abs(df0['actual'] - df0['predicted'])
print(f"Cluster {cluster} average error is {df0['error'].mean()}")
Cluster 0 average error is 8.155613850996852
Cluster 1 average error is 6.072222222222222
Cluster 2 average error is 6.120833333333333
Cluster 3 average error is 9.728318584070795
Cluster 4 average error is 10.060000000000002
Cluster 5 average error is 10.058653846153845
Cluster 6 average error is 7.592279655400928
Cluster 7 average error is 9.32463556851312
Cluster 9 average error is 8.865070422535211
Cluster 11 average error is 8.125809352517987
Cluster 12 average error is 6.854464285714286
Moving Forward
Now that we have reviewed how to create a model, and how to create them rapidly in a batched process (for loop), we are going to go over how to save a model for later use.
This is a very beneficial process for all modeling purposes, if you have a model that performs well and you need to re-run the model to incorporate updated information (think re-running the cluster analysis every month to make sure your clusters stay up to date with playstyle changes etc.) you don't want to risk losing any performance by retraining your model to run again. Additionally, if you are going to run a predictive model for fantasy point projections, you aren't going to want to train and test your model on your full dataset daily just to get an output.
Pickling
The pythonic process we will be utilizing for this need is called pickling. This process in python is very similar to pickling things in real life, and is likely where the name came from. Pickling can be done for anything you want to save for later in python, while we will be utilizing it for our model here, you could pickle a dictionary to be called at a later date. I like to use this to save data dictionaries for consistent use for name changes from where stats get pulled from and the DK and FD player lists. This way I don't have to be continually checking for them and manually replacing the differences.
First, let's go over how to pickle something python. Step 1 is going to be importing the pickle module. No pip install is necessary for this one, as pickle comes in the default python2 and python3 distribution.
import pickle
Now, to pickle an item, we need to first specify and open a text file where we want to save the information at, and provide write privileges to python to actually write to and save the file:
(You will need to have ran the previous bits of code above from the last session at least until the uniqueClusterVariable is defined for this work. Alternatively, you can define your own list to save and check with.)
with open(file="mytextfile.txt", mode="wb") as myfile:
pickle.dump(uniqueClusterList, myfile)
Now that we have opened our text file, and pickled our uniqe cluster list into the text file, we can do some testing to see how it works. Right now, our cluster list is saved in memory since we have defined it in this kernel session, as shown below:
uniqueClusterList
[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12]
Now, we will redefine the uniqueClusterList to be empty, and then check it to ensure it is no longer defined:
uniqueClusterList = []
uniqueClusterList
[]
Now that we have established that our list is empty, we are going to un-pickle our pickled list file, and re-assign the list value from our pickled file. To do this we will first open our previously saved text file with python with read only privileges, then we will load our pickled data into a variable of our choosing and verify it assigns properly.
with open("mytextfile.txt", mode="rb") as myfile:
uniqueClusterList = pickle.load(myfile)
Now that we previously verifed the uniqueClusterList variable is an empty list, when we call it after loading our pickled list into that variable, it should now be the original list. Let's check and see!
uniqueClusterList
[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12]
Now that we have established how to store something by pickling, and access data that has been pickled, this exact same process can be utilized for most anything in python you wish to store for later. For convenience, we will go ahead and step through saving and re-accessing a model to show how it works in that capacity as well.
First we will go ahead and copy some code from above to establish our model:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
dataset = pd.read_excel("SampleDataForYT.xlsx")
dataset['AveDiff'] = dataset['FP'] - dataset['SeasonAve']
clusterdf = pd.read_excel(r"C:\Users\nfwya\OneDrive\Youtube\playerClusterNew2020.xlsx")
clusterDict = {}
newDict = pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict()
clusterDict.update(pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict())
dataset['Cluster'] = dataset['Player']
dataset['Cluster'] = dataset['Cluster'].replace(clusterDict)
Now, we will simply isolate one cluster for use in this example:
clusterData = dataset[dataset['Cluster'] == 7]
Next up we will go through the process of creating, training, and testing our model:
dfFeatures = clusterData[['Last3', 'Last5', 'Last7']]
dfLabels = clusterData[['FP']]
labels = np.array(dfLabels)
features = np.array(dfFeatures)
train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = 10)
reg = DecisionTreeRegressor(random_state=10)
reg.fit(train, trainLabels)
train_predictions = reg.predict(train)
predictions = reg.predict(test)
df0 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df0['actual'] = testLabels
df0['predicted'] = predictions
df0['error'] = abs(df0['actual'] - df0['predicted'])
print(f"Cluster 7 average error is {df0['error'].mean()}")
Cluster 7 average error is 9.32463556851312
Next, we will pickle our model using the same method as before
with open(file="mymodelfile.txt", mode="wb") as myfile:
pickle.dump(reg, myfile)
After this, we will skip the clearing of the reg variable as we had done previously, and we will simply assign the model to a new variable name for testing purposes.
with open(file='mymodelfile.txt', mode='rb') as myfile:
newModel = pickle.load(myfile)
Finally we will run this new model over our same cluster 7 data and we should get the same result if we performed everything correctly. Notice we will not need to train, test, or fit this model to our data, as it is already prepped and ready to go in the same condition we pickled it in. We will however be using the same train/test datasets since that is we used previously, and they are still broken out in the same manner, so do not be confused by seeing the words train and test below, as they merely represent the 60/40 data splits from earlier.
train_predictions = newModel.predict(train)
predictions = newModel.predict(test)
df0 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df0['actual'] = testLabels
df0['predicted'] = predictions
df0['error'] = abs(df0['actual'] - df0['predicted'])
print(f"Cluster 7 average error is {df0['error'].mean()}")
Cluster 7 average error is 9.32463556851312
And that's all there is to it folks, you can use pickling across a wide variety of applications, and it can really save you a lot of time.