Monte Carlo-esque Simulation Introduction
Introduction
I know i've had several folks reach out to me asking about running monte carlo simulations for their fantasy prediction algorithms. Today, we are finally going to start down that conceptual path.
What is a monte carlo simulation?
As a little background information for those of you who are unfamiliar with monte carlo simulations, think of it as repeating an experiment a large number of times, with a certain degree of randomness involved. If anyone reading this is an expert in the subject matter, I sincerely apologize for any erroneous comparisons I'm about to make, but i think this is the easiest way to understand the core concepts.
For our purposes, we are interested in predicting fantasy points, if we run the exact same data, through the exact same machine learning algorithm, with the exact same defined randomness, we are going to get essentially the exact same predictions. Running this repeatedly 1000 times would add astonishingly small value to our overall data flow. However, if we introduce a variably defined randomness to this process, we will begin to see differing predictions at each iteration.
This can be beneficial if we are running a large iteration of experiments, as it allows our predictions to become more balanced over time. For instance, think of flipping a coin. Probabilistically, each side of the coin has a 50% chance of landing face up, if we only flip 3 times, and it's tails all 3 times, then our results will be vary skewed from what we expected. If we flip the coin 100 times, we will probably end up with a distribution much closer to 50/50, and while this fips are mutually independent from one another, meaning that simply because i flip tails on flips 1,2 and 3, the odds of landing heads up on flip 4 are no higher than before, over time the laws of probability will generally work themselves out.
Relating back to DFS, if we run one prediction set, and it scores a few players significantly lower than expected, that could very well be the event similar to tossing 3 tails in a row to start, and we won't get a balanced prediction until we run the simulation several times over, with varying degrees of randomness.
In Practice
We are going to be utilizing the randomness parameter in our random forest algorithm to imitate a monte carlo simulation. By allowing the randomness parameter to be randomized between runs, we are allowing the algorithm to have more freedom in how it predicts scores. This will allow us to run the algorithm multiple times, and get a better feel for what a 'high', 'average', and 'low' prediction is for any given event.
Now, if this explanation doesn't help sort out what a monte carlo simulation is or how it can be beneficial, feel free to reach and out and drop me a comment on the youtube video, I would be more than happy to discuss it in further detail.
Let's Get Started
Starting off, we are going to be very similar to our previous random forest videos, if you haven't seen those, I strongly suggest checking them out on youtube as it will make this go much easier for you.
We are first going to import our packages and pull our sample data in and get it ready to feed into the random forest algorithm. With the only difference being that we import the random package as well, this is built into python so there is no pip install needed.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import random
dataset = pd.read_excel("SampleDataForYT.xlsx")
dataset.head()
dataset.info()
dataset = dataset[['Player', 'Last3', 'Last5', 'Last7','SeasonAve', 'FP']].set_index('Player')
dataset.head()
featureNames = ['Last3', 'Last5', 'Last7', 'SeasonAve']
labelName = ['FP']
dfFeatures = dataset[featureNames]
dfFeatures.head()
dfLabels = dataset[labelName]
dfLabels.head()
labels = np.array(dfLabels)
features = np.array(dfFeatures)
train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state=0)
rf = RandomForestRegressor(random_state=0)
rf.fit(train, trainLabels)
Okay, so here is where we will make our first real change. Instead of running our algorithm over the test dataset to evaluate whether it is a good model or not, we are going to run it over the features dataset as a whole, including both the test and train dataset.
This will likely give us an unrealistically accurate seeming model because we will be running it over the very data it trained itself on. However, when we run a model in real life, it will not only be on the test data, so we are just getting used to this future case.
rfPred = rf.predict(features)
df = dataset[['FP']]
df['predicted'] = rfPred
df['error'] = abs(df['FP']-df['predicted']).astype('int')
df.describe()
Now that we've ran the data through our initial random forest algorithm, we are going to take a look at the results.
df.head()
At this point, the model does indeed seem to be exceptionally accurate. However, let's take a look at the highest scoring instances in the dataset, because running the entire dataset through one model, it will likely be best at predicting the most 'average' results.
As we can see below, the error gets significantly higher with this 'non-normal' results. I would wager that the very accurate predictions were included in our test dataset so the model already knows them!
df.sort_values('FP', ascending=False).head(15)
Now on to the new material, conceptually, this is not a lot to take in. We are basically just throwing the train/test and prediction lines of code into a for loop, running 10 predictions. Then, we are creating a random integer between 0 and 99 for each iteration to serve as the random_state parameter. This will force a degree of randomness for each prediction. Then we are creating a dataframe outside of the for loop to append our predictions too.
compareDF = dataset.copy()
for i in range(0,10):
n = random.randint(0,100)
train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = n)
reg = RandomForestRegressor(random_state=n)
reg.fit(train, trainLabels)
predictions = reg.predict(features)
compareDF[f'predict_{n}'] = predictions
Once that has finished running, we will go ahead and take a look at our results
compareDF.tail()
Now we can see what our predicted fantasy score was for each iteration, and how it compares to the actual score, and the last 3, 5, and 7 game averages along with the season average.
To continue reinforcing this knowledge, we are going to repeat the same process, but instead of appending the predicted scores to our parent dataframe, we are going to calculate and append the difference between predicted and actual for each iteration.
Do take notice that we will have new random integers for this iteration
errorDF = dataset.copy()
for i in range(0,10):
n = random.randint(0,100)
train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = n)
reg = RandomForestRegressor(random_state=n)
reg.fit(train, trainLabels)
predictions = reg.predict(features)
errorDF[f'error_{n}'] = abs(predictions - errorDF['FP']).astype('int')
errorDF.head()
Now we can see how far off we were on each prediction without having to do any mental math. Do keep in mind, again, that with the training dataset being included in the overall dataset we are predicting, our general error will be artificailly low.
errorDF.describe()
Using the decscribe function above, we can get a high level look at which iterations of the model were 'most accurate'.
Conclusion
I know this was a very basic introduction in which it felt like barely anything new python wise was introduced, but we will be reviewing different methods in how to incorporate this in a more streamlined manner in the next two lessons. I hope this has laid a decent foundation for the what, how, and why of monte carlo-esque simulations and their relevance to our daily fantasy sports prediction algorithms.