Pickling our Models for Future Use

Apr 15

Introduction

To start with we will revisit the code from our previous lesson where we created a predictive model for each player cluster. I have cleaned up the code a bit and removed the unnecessary bits for clarity's sake here, but the following code is pulled straight from the previous video.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeRegressor

dataset = pd.read_excel("SampleDataForYT.xlsx")

dataset['AveDiff'] = dataset['FP'] - dataset['SeasonAve']

clusterdf = pd.read_excel(r"C:\Users\nfwya\OneDrive\Youtube\playerClusterNew2020.xlsx")

clusterDict = {}

newDict = pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict()

clusterDict.update(pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict())

dataset['Cluster'] = dataset['Player']

dataset['Cluster'] = dataset['Cluster'].replace(clusterDict)

dataset.head()

	Player	Match_Up	Game_Date	FP	Last3	Last5	Last7	SeasonAve	AveDiff	Cluster
0	Abdel Nader	OKC @ NOP	2020-02-13	19.3	11.433333	10.46	9.728571	11.038235	8.261765	12
1	Brad Wanamaker	BOS vs. LAC	2020-02-13	16.7	18.066667	17.70	21.757143	15.162745	1.537255	0
2	Chris Paul	OKC @ NOP	2020-02-13	46.6	44.333333	40.16	40.485714	36.553704	10.046296	7
3	Daniel Theis	BOS vs. LAC	2020-02-13	25.0	28.333333	22.06	23.728571	23.464583	1.535417	11
4	Danilo Gallinari	OKC @ NOP	2020-02-13	36.9	32.500000	31.56	30.914286	30.508511	6.391489	9

dataset.describe()

	FP	Last3	Last5	Last7	SeasonAve	AveDiff	Cluster
count	14397.000000	14397.000000	14397.000000	14397.000000	14397.000000	1.439700e+04	14397.000000
mean	23.971800	23.933227	23.890260	23.857880	23.971800	1.263450e-16	6.440231
std	13.789482	11.611917	11.107322	10.883677	9.991757	9.503400e+00	4.018088
min	-2.000000	-1.000000	-1.000000	-1.000000	7.640625	-3.663000e+01	0.000000
25%	13.500000	15.333333	15.720000	15.866667	16.668519	-6.486538e+00	3.000000
50%	22.100000	22.166667	22.040000	21.985714	22.050000	-6.037037e-01	6.000000
75%	32.500000	30.766667	30.260000	30.028571	29.914286	5.965116e+00	9.000000
max	88.400000	81.600000	81.600000	81.600000	57.629167	4.685357e+01	12.000000

clusterList = clusterdf['Cluster'].tolist()

uniqueClusterList = list(set(clusterList))

uniqueClusterList

[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12]

for cluster in uniqueClusterList:
    clusterData = dataset[dataset['Cluster'] == cluster]
    dfFeatures = clusterData[['Last3', 'Last5', 'Last7']]
    dfLabels = clusterData[['FP']]
    labels = np.array(dfLabels)
    features = np.array(dfFeatures)
    train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = 10)
    reg = DecisionTreeRegressor(random_state=10)
    reg.fit(train, trainLabels)
    train_predictions = reg.predict(train)
    predictions = reg.predict(test)
    df0 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
    df0['actual'] = testLabels
    df0['predicted'] = predictions
    df0['error'] = abs(df0['actual'] - df0['predicted'])
    print(f"Cluster {cluster} average error is {df0['error'].mean()}")

Cluster 0 average error is 8.155613850996852
Cluster 1 average error is 6.072222222222222
Cluster 2 average error is 6.120833333333333
Cluster 3 average error is 9.728318584070795
Cluster 4 average error is 10.060000000000002
Cluster 5 average error is 10.058653846153845
Cluster 6 average error is 7.592279655400928
Cluster 7 average error is 9.32463556851312
Cluster 9 average error is 8.865070422535211
Cluster 11 average error is 8.125809352517987
Cluster 12 average error is 6.854464285714286

Moving Forward

Now that we have reviewed how to create a model, and how to create them rapidly in a batched process (for loop), we are going to go over how to save a model for later use.

This is a very beneficial process for all modeling purposes, if you have a model that performs well and you need to re-run the model to incorporate updated information (think re-running the cluster analysis every month to make sure your clusters stay up to date with playstyle changes etc.) you don't want to risk losing any performance by retraining your model to run again. Additionally, if you are going to run a predictive model for fantasy point projections, you aren't going to want to train and test your model on your full dataset daily just to get an output.

Pickling

The pythonic process we will be utilizing for this need is called pickling. This process in python is very similar to pickling things in real life, and is likely where the name came from. Pickling can be done for anything you want to save for later in python, while we will be utilizing it for our model here, you could pickle a dictionary to be called at a later date. I like to use this to save data dictionaries for consistent use for name changes from where stats get pulled from and the DK and FD player lists. This way I don't have to be continually checking for them and manually replacing the differences.

First, let's go over how to pickle something python. Step 1 is going to be importing the pickle module. No pip install is necessary for this one, as pickle comes in the default python2 and python3 distribution.

import pickle

Now, to pickle an item, we need to first specify and open a text file where we want to save the information at, and provide write privileges to python to actually write to and save the file:

(You will need to have ran the previous bits of code above from the last session at least until the uniqueClusterVariable is defined for this work. Alternatively, you can define your own list to save and check with.)

with open(file="mytextfile.txt", mode="wb") as myfile:
    pickle.dump(uniqueClusterList, myfile)

Now that we have opened our text file, and pickled our uniqe cluster list into the text file, we can do some testing to see how it works. Right now, our cluster list is saved in memory since we have defined it in this kernel session, as shown below:

uniqueClusterList

[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12]

Now, we will redefine the uniqueClusterList to be empty, and then check it to ensure it is no longer defined:

uniqueClusterList = []
uniqueClusterList

[]

Now that we have established that our list is empty, we are going to un-pickle our pickled list file, and re-assign the list value from our pickled file. To do this we will first open our previously saved text file with python with read only privileges, then we will load our pickled data into a variable of our choosing and verify it assigns properly.

with open("mytextfile.txt", mode="rb") as myfile:
    uniqueClusterList = pickle.load(myfile)

Now that we previously verifed the uniqueClusterList variable is an empty list, when we call it after loading our pickled list into that variable, it should now be the original list. Let's check and see!

uniqueClusterList

[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12]

Now that we have established how to store something by pickling, and access data that has been pickled, this exact same process can be utilized for most anything in python you wish to store for later. For convenience, we will go ahead and step through saving and re-accessing a model to show how it works in that capacity as well.

First we will go ahead and copy some code from above to establish our model:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeRegressor

dataset = pd.read_excel("SampleDataForYT.xlsx")
dataset['AveDiff'] = dataset['FP'] - dataset['SeasonAve']
clusterdf = pd.read_excel(r"C:\Users\nfwya\OneDrive\Youtube\playerClusterNew2020.xlsx")
clusterDict = {}
newDict = pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict()
clusterDict.update(pd.Series(clusterdf['Cluster'].values, index=clusterdf['Player']).to_dict())
dataset['Cluster'] = dataset['Player']
dataset['Cluster'] = dataset['Cluster'].replace(clusterDict)

Now, we will simply isolate one cluster for use in this example:

clusterData = dataset[dataset['Cluster'] == 7]

Next up we will go through the process of creating, training, and testing our model:

dfFeatures = clusterData[['Last3', 'Last5', 'Last7']]
dfLabels = clusterData[['FP']]
labels = np.array(dfLabels)
features = np.array(dfFeatures)
train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = 10)
reg = DecisionTreeRegressor(random_state=10)
reg.fit(train, trainLabels)
train_predictions = reg.predict(train)
predictions = reg.predict(test)
df0 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df0['actual'] = testLabels
df0['predicted'] = predictions
df0['error'] = abs(df0['actual'] - df0['predicted'])
print(f"Cluster 7 average error is {df0['error'].mean()}")

Cluster 7 average error is 9.32463556851312

Next, we will pickle our model using the same method as before

with open(file="mymodelfile.txt", mode="wb") as myfile:
    pickle.dump(reg, myfile)

After this, we will skip the clearing of the reg variable as we had done previously, and we will simply assign the model to a new variable name for testing purposes.

with open(file='mymodelfile.txt', mode='rb') as myfile:
    newModel = pickle.load(myfile)

Finally we will run this new model over our same cluster 7 data and we should get the same result if we performed everything correctly. Notice we will not need to train, test, or fit this model to our data, as it is already prepped and ready to go in the same condition we pickled it in. We will however be using the same train/test datasets since that is we used previously, and they are still broken out in the same manner, so do not be confused by seeing the words train and test below, as they merely represent the 60/40 data splits from earlier.

train_predictions = newModel.predict(train)
predictions = newModel.predict(test)
df0 = pd.DataFrame(test, columns=['Last3', 'Last5', 'Last7'])
df0['actual'] = testLabels
df0['predicted'] = predictions
df0['error'] = abs(df0['actual'] - df0['predicted'])
print(f"Cluster 7 average error is {df0['error'].mean()}")

Cluster 7 average error is 9.32463556851312

And that's all there is to it folks, you can use pickling across a wide variety of applications, and it can really save you a lot of time.

Nick's Niche

Pickling our Models for Future Use

Introduction

Moving Forward

Pickling

Monte Carlo-esque Simulation Introduction

Intro to Predicting Fantasy Sports Scores