Monte Carlo Part 2
Applying Monte-Carlo(esque) Simulation to Real Life Scenario
Good Day everyone, today we are going to be building upon our last lesson and actually applying what we've learned to predicting a real life scenario for daily fantasy sports.
We will be using our tried and true SampleDataForYT dataset, but we will be slicing off the data to account for all games up until a specific date, then using a real player list from fanduel for the next day to simulate actual projections.
There are a few disclaimers though, as usual this dataset is completely useless for actual projections, we all can do much better than simply pulling in a few rolling averages for each player. Additionally, this is not entirely a real life scenario, as it is all for a previous season, and the season average is still for the larger datset as a whole, not the season average up until that point.
With that said lets jump in!
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import random
dataset = pd.read_excel("SampleDataForYT.xlsx")
dataset.head()
Player | Match_Up | Game_Date | FP | Last3 | Last5 | Last7 | SeasonAve | |
---|---|---|---|---|---|---|---|---|
0 | Abdel Nader | OKC @ NOP | 2020-02-13 | 19.3 | 11.433333 | 10.46 | 9.728571 | 11.038235 |
1 | Brad Wanamaker | BOS vs. LAC | 2020-02-13 | 16.7 | 18.066667 | 17.70 | 21.757143 | 15.162745 |
2 | Chris Paul | OKC @ NOP | 2020-02-13 | 46.6 | 44.333333 | 40.16 | 40.485714 | 36.553704 |
3 | Daniel Theis | BOS vs. LAC | 2020-02-13 | 25.0 | 28.333333 | 22.06 | 23.728571 | 23.464583 |
4 | Danilo Gallinari | OKC @ NOP | 2020-02-13 | 36.9 | 32.500000 | 31.56 | 30.914286 | 30.508511 |
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14397 entries, 0 to 14396
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Player 14397 non-null object
1 Match_Up 14397 non-null object
2 Game_Date 14397 non-null datetime64[ns]
3 FP 14397 non-null float64
4 Last3 14397 non-null float64
5 Last5 14397 non-null float64
6 Last7 14397 non-null float64
7 SeasonAve 14397 non-null float64
dtypes: datetime64[ns](1), float64(5), object(2)
memory usage: 899.9+ KB
For the first deviation, we will be pulling in a player list from fanduel for January 4th, 2020. This is a completely arbitrary dataset, only used because I have it handy, and it falls within the data range that we have already been working with.
playerList = pd.read_csv("20200104.csv")
playerList.head()
Id | Position | First Name | Nickname | Last Name | FPPG | Played | Salary | Game | Team | Opponent | Injury Indicator | Injury Details | Tier | Unnamed: 14 | Unnamed: 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42349-40199 | SF | Giannis | Giannis Antetokounmpo | Antetokounmpo | 57.957575 | 33 | 12000 | SA@MIL | MIL | SA | NaN | NaN | NaN | NaN | NaN |
1 | 42349-84669 | PG | Luka | Luka Doncic | Doncic | 53.653333 | 30 | 10900 | CHA@DAL | DAL | CHA | NaN | NaN | NaN | NaN | NaN |
2 | 42349-15557 | C | Andre | Andre Drummond | Drummond | 48.348485 | 33 | 9700 | DET@GS | DET | GS | NaN | NaN | NaN | NaN | NaN |
3 | 42349-84671 | PG | Trae | Trae Young | Young | 45.159374 | 32 | 9000 | IND@ATL | ATL | IND | NaN | NaN | NaN | NaN | NaN |
4 | 42349-55062 | C | Nikola | Nikola Jokic | Jokic | 41.567648 | 34 | 8900 | DEN@WAS | DEN | WAS | NaN | NaN | NaN | NaN | NaN |
As we can see this is a standard player list from fanduel, with 2 extra empty columns at the end that pandas brought in that we don't neeed. To prepare for what we will need accomplish lets think about what needs to occur here.
First, we have a full season worth of data here, but we only want to utilize the games to train on up until the day before this player list. So we will need to slice out our data based on date.
Next, we want to only have one data point per player on the player list, so once we have out trained model, we can feed in one dataset with the most recent rolling window for each player associated with it.
We will be utilizing the group by method in pandas and returning the first value for each player after it has been sorted from most recent to least recent.
Lets start with sorting our box scores by game date.
dataset = dataset.sort_values('Game_Date', ascending=False)
Next up, we will be dropping all players with an injury designation of 'Out' as we know they will not be playing, and then setting the index for the playerList dataframe to the player name.
playerList = playerList[playerList['Injury Indicator'] != 'O'].set_index('Nickname')
After this, if we think about the types of data we want to pull through from the player list, it's going to be just the player and the team. So we need to go ahead and redefine the playerList datafame to only include those two values.
playerList = playerList[[ 'Team']]
Then we are going to go ahead and start working on removing box scores that happened on January 4th and after. To accomplish this we are going to create a copy of our dataset datafame and call this dataframe2, then we are going to set the index to the Game_Date field.
dataset2 = dataset.copy()
dataset2.set_index('Game_Date', inplace=True)
dataset2.head()
Player | Match_Up | FP | Last3 | Last5 | Last7 | SeasonAve | |
---|---|---|---|---|---|---|---|
Game_Date | |||||||
2020-02-13 | Abdel Nader | OKC @ NOP | 19.3 | 11.433333 | 10.46 | 9.728571 | 11.038235 |
2020-02-13 | Brad Wanamaker | BOS vs. LAC | 16.7 | 18.066667 | 17.70 | 21.757143 | 15.162745 |
2020-02-13 | Chris Paul | OKC @ NOP | 46.6 | 44.333333 | 40.16 | 40.485714 | 36.553704 |
2020-02-13 | Daniel Theis | BOS vs. LAC | 25.0 | 28.333333 | 22.06 | 23.728571 | 23.464583 |
2020-02-13 | Danilo Gallinari | OKC @ NOP | 36.9 | 32.500000 | 31.56 | 30.914286 | 30.508511 |
Now we are going to define a new dataframe (good practice when learning so you don't accidentally overwrite anything and have to start over) and slice out all values so that we only keep values where the game_date, or now the index, is less than or equal to January 3, 2020.
dataset3 = dataset2[(dataset2.index<='2020-1-3')]
dataset3.head()
Player | Match_Up | FP | Last3 | Last5 | Last7 | SeasonAve | |
---|---|---|---|---|---|---|---|
Game_Date | |||||||
2020-01-03 | Aaron Gordon | ORL vs. MIA | 29.1 | 25.733333 | 31.12 | 27.928571 | 29.914286 |
2020-01-03 | Al Horford | PHI @ HOU | 19.6 | 15.833333 | 21.86 | 23.828571 | 30.150000 |
2020-01-03 | Alex Len | ATL @ BOS | 26.6 | 26.666667 | 26.14 | 22.142857 | 20.597436 |
2020-01-03 | Allen Crabbe | ATL @ BOS | 13.2 | 10.600000 | 13.70 | 11.114286 | 10.134286 |
2020-01-03 | Anfernee Simons | POR @ WAS | 4.9 | 10.466667 | 16.32 | 15.100000 | 14.944444 |
Now if we sort this dataframe by player name, we can see that we have every box score from each player prior to our defined date. However, we only want the most recent value to use to form our predictions.
dataset3.sort_values('Player').head(15)
Player | Match_Up | FP | Last3 | Last5 | Last7 | SeasonAve | |
---|---|---|---|---|---|---|---|
Game_Date | |||||||
2020-01-03 | Aaron Gordon | ORL vs. MIA | 29.1 | 25.733333 | 31.120000 | 27.928571 | 29.914286 |
2019-11-13 | Aaron Gordon | ORL vs. PHI | 48.1 | 31.933333 | 30.600000 | 29.800000 | 29.914286 |
2019-12-15 | Aaron Gordon | ORL @ NOP | 17.8 | 30.266667 | 29.880000 | 33.014286 | 29.914286 |
2019-12-17 | Aaron Gordon | ORL @ UTA | 14.0 | 21.666667 | 27.280000 | 30.128571 | 29.914286 |
2019-11-08 | Aaron Gordon | ORL vs. MEM | 25.8 | 27.666667 | 27.720000 | 27.542857 | 29.914286 |
2019-12-18 | Aaron Gordon | ORL @ DEN | 25.9 | 19.233333 | 26.140000 | 27.042857 | 29.914286 |
2019-11-06 | Aaron Gordon | ORL @ DAL | 31.3 | 31.166667 | 29.700000 | 27.842857 | 29.914286 |
2019-12-20 | Aaron Gordon | ORL @ POR | 33.2 | 24.366667 | 24.820000 | 27.928571 | 29.914286 |
2019-11-05 | Aaron Gordon | ORL @ OKC | 25.9 | 27.166667 | 27.140000 | 26.785714 | 29.914286 |
2019-11-02 | Aaron Gordon | ORL vs. DEN | 36.3 | 30.433333 | 27.540000 | 26.933333 | 29.914286 |
2019-12-23 | Aaron Gordon | ORL vs. CHI | 45.2 | 34.766667 | 27.220000 | 29.871429 | 29.914286 |
2019-11-01 | Aaron Gordon | ORL vs. MIL | 19.3 | 24.500000 | 25.060000 | 25.060000 | 29.914286 |
2019-10-30 | Aaron Gordon | ORL vs. NYK | 35.7 | 27.366667 | 26.500000 | 26.500000 | 29.914286 |
2019-12-27 | Aaron Gordon | ORL vs. PHI | 35.7 | 38.033333 | 30.800000 | 29.285714 | 29.914286 |
2019-10-28 | Aaron Gordon | ORL @ TOR | 18.5 | 23.433333 | 23.433333 | 23.433333 | 29.914286 |
Fortunately, the season average is going to be the same for each record of each player since we didn't recalculate it to be the season average up until this point (in real life this would not be the case, however we also would not have to be slicing off data like this so it's fine for this example) so we can go ahead and group by player and season average, keeping only the first record since we are already sorted by most recent game date to oldest.
dataset3 = dataset3.groupby(["Player","SeasonAve"]).first()
dataset3.sort_values('Player').head(25)
Match_Up | FP | Last3 | Last5 | Last7 | ||
---|---|---|---|---|---|---|
Player | SeasonAve | |||||
Aaron Gordon | 29.914286 | ORL vs. MIA | 29.1 | 25.733333 | 31.12 | 27.928571 |
Aaron Holiday | 19.195918 | IND vs. DEN | 19.0 | 24.966667 | 29.64 | 27.785714 |
Abdel Nader | 11.038235 | OKC vs. DAL | 5.0 | 6.000000 | 8.58 | 9.000000 |
Al Horford | 30.150000 | PHI @ HOU | 19.6 | 15.833333 | 21.86 | 23.828571 |
Al-Farouq Aminu | 16.170588 | ORL vs. TOR | 12.9 | 11.866667 | 13.08 | 16.385714 |
Alec Burks | 28.585714 | GSW @ MIN | 11.1 | 26.633333 | 25.08 | 31.271429 |
Alex Caruso | 14.219149 | LAL vs. DAL | 12.2 | 14.666667 | 12.38 | 13.871429 |
Alex Len | 20.597436 | ATL @ BOS | 26.6 | 26.666667 | 26.14 | 22.142857 |
Allen Crabbe | 10.134286 | ATL @ BOS | 13.2 | 10.600000 | 13.70 | 11.114286 |
Allonzo Trier | 9.850000 | NYK @ WAS | 0.2 | 7.366667 | 9.30 | 9.514286 |
Andre Drummond | 48.035294 | DET @ LAC | 32.4 | 48.866667 | 43.38 | 45.028571 |
Andrew Wiggins | 36.664444 | MIN @ SAC | 37.5 | 38.200000 | 37.12 | 38.557143 |
Anfernee Simons | 14.944444 | POR @ WAS | 4.9 | 10.466667 | 16.32 | 15.100000 |
Anthony Davis | 52.032609 | LAL vs. NOP | 72.1 | 53.033333 | 49.22 | 52.328571 |
Anthony Tolliver | 10.405714 | POR @ WAS | 3.4 | 9.200000 | 9.32 | 9.557143 |
Aron Baynes | 22.363636 | PHX vs. NYK | 39.4 | 28.066667 | 25.54 | 24.200000 |
Austin Rivers | 15.638000 | HOU vs. PHI | 6.0 | 9.533333 | 12.48 | 12.400000 |
Avery Bradley | 14.587179 | LAL vs. NOP | 12.9 | 18.133333 | 13.54 | 11.357143 |
Bam Adebayo | 40.303704 | MIA @ ORL | 34.5 | 33.533333 | 33.64 | 37.128571 |
Ben McLemore | 16.598077 | HOU vs. PHI | 3.2 | 12.500000 | 11.40 | 12.371429 |
Ben Simmons | 43.598113 | PHI @ HOU | 79.1 | 53.833333 | 52.44 | 51.628571 |
Bismack Biyombo | 18.823256 | CHA @ CLE | 15.2 | 17.833333 | 22.72 | 19.685714 |
Blake Griffin | 26.127778 | DET @ SAS | 17.4 | 25.033333 | 21.58 | 19.900000 |
Bobby Portis | 19.109091 | NYK @ PHX | 26.8 | 24.033333 | 20.10 | 24.342857 |
Bogdan Bogdanovic | 24.655814 | SAC vs. MEM | 19.4 | 18.666667 | 24.94 | 25.457143 |
Now we can see that we no longer have any duplicates for each player, so we will go ahead and reset the index, and then set the index back to the players name.
dataset3 = dataset3.reset_index().set_index('Player')
dataset3.head(25)
SeasonAve | Match_Up | FP | Last3 | Last5 | Last7 | |
---|---|---|---|---|---|---|
Player | ||||||
Aaron Gordon | 29.914286 | ORL vs. MIA | 29.1 | 25.733333 | 31.12 | 27.928571 |
Aaron Holiday | 19.195918 | IND vs. DEN | 19.0 | 24.966667 | 29.64 | 27.785714 |
Abdel Nader | 11.038235 | OKC vs. DAL | 5.0 | 6.000000 | 8.58 | 9.000000 |
Al Horford | 30.150000 | PHI @ HOU | 19.6 | 15.833333 | 21.86 | 23.828571 |
Al-Farouq Aminu | 16.170588 | ORL vs. TOR | 12.9 | 11.866667 | 13.08 | 16.385714 |
Alec Burks | 28.585714 | GSW @ MIN | 11.1 | 26.633333 | 25.08 | 31.271429 |
Alex Caruso | 14.219149 | LAL vs. DAL | 12.2 | 14.666667 | 12.38 | 13.871429 |
Alex Len | 20.597436 | ATL @ BOS | 26.6 | 26.666667 | 26.14 | 22.142857 |
Allen Crabbe | 10.134286 | ATL @ BOS | 13.2 | 10.600000 | 13.70 | 11.114286 |
Allonzo Trier | 9.850000 | NYK @ WAS | 0.2 | 7.366667 | 9.30 | 9.514286 |
Andre Drummond | 48.035294 | DET @ LAC | 32.4 | 48.866667 | 43.38 | 45.028571 |
Andrew Wiggins | 36.664444 | MIN @ SAC | 37.5 | 38.200000 | 37.12 | 38.557143 |
Anfernee Simons | 14.944444 | POR @ WAS | 4.9 | 10.466667 | 16.32 | 15.100000 |
Anthony Davis | 52.032609 | LAL vs. NOP | 72.1 | 53.033333 | 49.22 | 52.328571 |
Anthony Tolliver | 10.405714 | POR @ WAS | 3.4 | 9.200000 | 9.32 | 9.557143 |
Aron Baynes | 22.363636 | PHX vs. NYK | 39.4 | 28.066667 | 25.54 | 24.200000 |
Austin Rivers | 15.638000 | HOU vs. PHI | 6.0 | 9.533333 | 12.48 | 12.400000 |
Avery Bradley | 14.587179 | LAL vs. NOP | 12.9 | 18.133333 | 13.54 | 11.357143 |
Bam Adebayo | 40.303704 | MIA @ ORL | 34.5 | 33.533333 | 33.64 | 37.128571 |
Ben McLemore | 16.598077 | HOU vs. PHI | 3.2 | 12.500000 | 11.40 | 12.371429 |
Ben Simmons | 43.598113 | PHI @ HOU | 79.1 | 53.833333 | 52.44 | 51.628571 |
Bismack Biyombo | 18.823256 | CHA @ CLE | 15.2 | 17.833333 | 22.72 | 19.685714 |
Blake Griffin | 26.127778 | DET @ SAS | 17.4 | 25.033333 | 21.58 | 19.900000 |
Bobby Portis | 19.109091 | NYK @ PHX | 26.8 | 24.033333 | 20.10 | 24.342857 |
Bogdan Bogdanovic | 24.655814 | SAC vs. MEM | 19.4 | 18.666667 | 24.94 | 25.457143 |
Now we are going to join our two dataframes together, joining the most recent predictive data to our player list data. We will be using a left join because our player list is the left dataframe in the way we have it set up. You could use a right join if you prefer by setting it up as dataset3.join(playerList, how='right').
We don't need any further parameters because the player name is the index for both dataframes, if this were not the case we would need to specify which columns to join on.
This will move our dataset3 columns over into the playerList dataframe for every record that has a matching player name, if a player on the playerList dataframe does not match up to a player on the dataset3 dataframe, it will simply leave those columns blank. If you only want to include records where the player names match, you would want to use an inner join rather than a left/right join.
playerList2 = playerList.join(dataset3, how='left')
playerList2.head(30)
Team | SeasonAve | Match_Up | FP | Last3 | Last5 | Last7 | |
---|---|---|---|---|---|---|---|
Nickname | |||||||
Giannis Antetokounmpo | MIL | 57.629167 | MIL vs. MIN | 60.4 | 50.233333 | 53.24 | 57.100000 |
Luka Doncic | DAL | 53.890698 | DAL vs. BKN | 56.1 | 49.633333 | 53.56 | 54.014286 |
Andre Drummond | DET | 48.035294 | DET @ LAC | 32.4 | 48.866667 | 43.38 | 45.028571 |
Trae Young | ATL | 47.630000 | ATL @ BOS | 43.0 | 38.233333 | 45.46 | 47.257143 |
Nikola Jokic | DEN | 45.792727 | DEN @ IND | 28.4 | 30.200000 | 38.52 | 40.385714 |
Rudy Gobert | UTA | 41.232692 | UTA @ CHI | 38.4 | 36.766667 | 41.62 | 43.214286 |
Bradley Beal | WAS | 44.806522 | WAS vs. ORL | 37.3 | 36.000000 | 44.60 | 46.071429 |
Brandon Ingram | NO | 41.010638 | NOP @ LAL | 33.1 | 41.266667 | 43.38 | 44.171429 |
LaMarcus Aldridge | SA | 37.226000 | SAS vs. OKC | 44.4 | 46.233333 | 47.98 | 44.814286 |
Jrue Holiday | NO | 39.428261 | NOP @ LAL | 27.6 | 36.200000 | 39.22 | 37.685714 |
Domantas Sabonis | IND | 41.486538 | IND vs. DEN | 49.3 | 44.666667 | 39.86 | 42.314286 |
Nikola Vucevic | ORL | 41.504545 | ORL vs. MIA | 44.7 | 41.933333 | 43.46 | 45.371429 |
Jayson Tatum | BOS | 39.728000 | BOS vs. ATL | 27.3 | 29.600000 | 33.32 | 39.357143 |
Zach LaVine | CHI | 39.712727 | CHI vs. UTA | 40.3 | 35.666667 | 38.22 | 40.357143 |
Shai Gilgeous-Alexander | OKC | 35.169091 | OKC @ SAS | 48.9 | 47.600000 | 43.04 | 43.800000 |
DeMar DeRozan | SA | 38.976923 | SAS vs. OKC | 36.1 | 41.700000 | 42.14 | 37.942857 |
John Collins | ATL | 39.990000 | ATL @ BOS | 26.1 | 36.466667 | 39.72 | 39.985714 |
Donovan Mitchell | UTA | 37.175472 | UTA @ CHI | 27.3 | 36.766667 | 37.90 | 39.042857 |
Devonte' Graham | CHA | 34.744444 | CHA @ CLE | 33.9 | 34.666667 | 38.02 | 35.485714 |
Jaylen Brown | BOS | 32.925000 | BOS vs. ATL | 39.0 | 38.100000 | 37.36 | 39.300000 |
De'Aaron Fox | SAC | 38.942857 | SAC vs. MEM | 65.3 | 43.000000 | 43.48 | 40.542857 |
Khris Middleton | MIL | 35.570213 | MIL vs. MIN | 31.6 | 37.100000 | 40.00 | 39.000000 |
Chris Paul | OKC | 36.553704 | OKC @ SAS | 38.1 | 39.800000 | 41.50 | 37.685714 |
Gordon Hayward | BOS | 33.305405 | BOS vs. ATL | 29.7 | 34.466667 | 33.28 | 30.771429 |
Kevin Love | CLE | 34.191304 | CLE vs. CHA | 35.6 | 36.400000 | 36.68 | 35.728571 |
Richaun Holmes | SAC | 30.900000 | SAC vs. MEM | 31.4 | 35.200000 | 36.66 | 37.114286 |
Will Barton | DEN | 31.827083 | DEN @ IND | 35.5 | 35.633333 | 35.24 | 35.571429 |
Draymond Green | GS | 29.602439 | GSW @ MIN | 16.6 | 27.233333 | 30.56 | 31.128571 |
Buddy Hield | SAC | 32.250000 | SAC vs. MEM | 39.9 | 38.533333 | 34.18 | 29.314286 |
Jamal Murray | DEN | 33.845455 | DEN @ IND | 41.5 | 32.600000 | 28.72 | 29.400000 |
Once our join has succeeded, we will go ahead and only keep the columns we will feed into our random forest, and then we are going to take a peak at our data sorted to see if we have any blank rows for mismatched names.
playerList2 = playerList2[['Last3', 'Last5', 'Last7', 'SeasonAve']]
playerList2.sort_values('Last7')
Last3 | Last5 | Last7 | SeasonAve | |
---|---|---|---|---|
Nickname | ||||
Frank Jackson | 2.566667 | 4.32 | 3.428571 | 10.888571 |
Jordan Poole | 4.200000 | 6.52 | 5.014286 | 14.668750 |
Jacob Evans | 8.600000 | 8.08 | 6.542857 | 10.250000 |
Kenrich Williams | 4.766667 | 5.36 | 6.900000 | 15.900000 |
Nicolo Melli | 3.800000 | 4.66 | 7.414286 | 14.125641 |
... | ... | ... | ... | ... |
Frank Mason | NaN | NaN | NaN | NaN |
Wenyen Gabriel | NaN | NaN | NaN | NaN |
Mike Muscala | NaN | NaN | NaN | NaN |
Drew Eubanks | NaN | NaN | NaN | NaN |
Cristiano Felicio | NaN | NaN | NaN | NaN |
264 rows × 4 columns
We do indeed have some blank records. This will not be able to pushed throught he random forest algorithm as it stands. We have a couple options, we can either replace all the NaNs with a numerical value, such as a 0 or the average of the column, or simply drop the records with a NaN in them. For some datasets it may make sense to populate with a numerical value, however in this instance we are just going to drop any row with a NaN to prepare for the random forest.
playerList2.dropna(inplace=True)
playerList2.sort_values('Last7')
Last3 | Last5 | Last7 | SeasonAve | |
---|---|---|---|---|
Nickname | ||||
Frank Jackson | 2.566667 | 4.32 | 3.428571 | 10.888571 |
Jordan Poole | 4.200000 | 6.52 | 5.014286 | 14.668750 |
Jacob Evans | 8.600000 | 8.08 | 6.542857 | 10.250000 |
Kenrich Williams | 4.766667 | 5.36 | 6.900000 | 15.900000 |
Nicolo Melli | 3.800000 | 4.66 | 7.414286 | 14.125641 |
... | ... | ... | ... | ... |
Nikola Vucevic | 41.933333 | 43.46 | 45.371429 | 41.504545 |
Bradley Beal | 36.000000 | 44.60 | 46.071429 | 44.806522 |
Trae Young | 38.233333 | 45.46 | 47.257143 | 47.630000 |
Luka Doncic | 49.633333 | 53.56 | 54.014286 | 53.890698 |
Giannis Antetokounmpo | 50.233333 | 53.24 | 57.100000 | 57.629167 |
174 rows × 4 columns
Alright, now it may start to get a little confusing, so we are going to recap the process we are about to begin real quick.
We need the full dataset (minus games happening on or after January 4th) to train our model on, but we will be calculating predictions on the datset we just finished manipulating. So we will create our Features and Labels numpy arrays to break out into our training and testing datasets off of our initial dataset, but once we get into our predictions stage we will be calling the dataframe we just created.
So, on to creating our dataframe to feed our train and test datasets.
dataset4 = dataset[['Player', 'Last3', 'Last5', 'Last7','SeasonAve', 'FP']].set_index('Player')
#dataset.head()
featureNames = ['Last3', 'Last5', 'Last7', 'SeasonAve']
labelName = ['FP']
dfFeatures = dataset4[featureNames]
#dfFeatures.head()
dfLabels = dataset4[labelName]
#dfLabels.head()
labels = np.array(dfLabels)
features = np.array(dfFeatures)
Next up, we will be running a similar for loop as our last lesson, however we will be using our new playerList dataset to drive the predictions rather than our Feature dataset. The following method is how you would set up your for loop if you wanted to train and test your model every. single. time. that you ran your code. This is not the most efficient, but it is certainly an option. Generally I like to retrain my models after a couple weeks, or if a significant change in data has occurred, like at the beginning of the season when every new datapoint significantly alters the overall data spread.
##WHEN YOU WANT TO RETRAIN YOUR MODELS EVERY SINGLE CONTEST DAY
compareDF = playerList2.copy()
for i in range(0,10):
n = random.randint(0,100)
train, test, trainLabels, testLabels = train_test_split(features, labels, test_size=(0.4), random_state = n)
reg = RandomForestRegressor(random_state=n)
reg.fit(train, trainLabels)
predictions = reg.predict(playerList2)
compareDF[f'predict_{n}'] = predictions
<ipython-input-30-458b6ac04ffd>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
reg.fit(train, trainLabels)
<ipython-input-30-458b6ac04ffd>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
reg.fit(train, trainLabels)
<ipython-input-30-458b6ac04ffd>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
reg.fit(train, trainLabels)
<ipython-input-30-458b6ac04ffd>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
reg.fit(train, trainLabels)
<ipython-input-30-458b6ac04ffd>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
reg.fit(train, trainLabels)
<ipython-input-30-458b6ac04ffd>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
reg.fit(train, trainLabels)
<ipython-input-30-458b6ac04ffd>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
reg.fit(train, trainLabels)
<ipython-input-30-458b6ac04ffd>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
reg.fit(train, trainLabels)
<ipython-input-30-458b6ac04ffd>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
reg.fit(train, trainLabels)
<ipython-input-30-458b6ac04ffd>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
reg.fit(train, trainLabels)
compareDF.head(15)
Last3 | Last5 | Last7 | SeasonAve | predict_68 | predict_96 | predict_89 | predict_29 | predict_98 | predict_92 | predict_18 | predict_23 | predict_90 | predict_86 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Nickname | ||||||||||||||
Giannis Antetokounmpo | 50.233333 | 53.24 | 57.100000 | 57.629167 | 48.144 | 51.306 | 57.146 | 52.281 | 57.86900 | 57.228 | 56.066 | 51.892 | 57.142 | 55.500 |
Luka Doncic | 49.633333 | 53.56 | 54.014286 | 53.890698 | 49.734 | 51.617 | 45.863 | 53.768 | 47.28100 | 55.586 | 51.688 | 53.818 | 53.765 | 54.411 |
Andre Drummond | 48.866667 | 43.38 | 45.028571 | 48.035294 | 37.932 | 41.594 | 39.394 | 38.562 | 47.09275 | 44.612 | 42.832 | 37.646 | 37.518 | 45.959 |
Trae Young | 38.233333 | 45.46 | 47.257143 | 47.630000 | 41.868 | 40.024 | 41.513 | 35.700 | 39.10100 | 42.033 | 42.509 | 42.135 | 36.573 | 41.767 |
Nikola Jokic | 30.200000 | 38.52 | 40.385714 | 45.792727 | 29.098 | 28.041 | 30.592 | 31.797 | 29.65100 | 29.863 | 28.514 | 29.532 | 28.612 | 28.342 |
Rudy Gobert | 36.766667 | 41.62 | 43.214286 | 41.232692 | 39.017 | 37.541 | 36.724 | 35.832 | 37.04000 | 37.445 | 37.841 | 36.168 | 35.652 | 35.756 |
Bradley Beal | 36.000000 | 44.60 | 46.071429 | 44.806522 | 38.560 | 39.879 | 35.365 | 38.174 | 40.55200 | 37.772 | 40.619 | 38.923 | 42.808 | 39.374 |
Brandon Ingram | 41.266667 | 43.38 | 44.171429 | 41.010638 | 33.768 | 34.803 | 34.729 | 36.160 | 34.60700 | 35.315 | 36.080 | 34.915 | 38.133 | 33.720 |
LaMarcus Aldridge | 46.233333 | 47.98 | 44.814286 | 37.226000 | 41.126 | 44.250 | 42.822 | 43.620 | 45.51000 | 45.242 | 45.381 | 43.653 | 46.194 | 44.115 |
Jrue Holiday | 36.200000 | 39.22 | 37.685714 | 39.428261 | 35.170 | 32.030 | 34.808 | 31.468 | 29.57300 | 30.936 | 35.881 | 28.728 | 36.331 | 37.532 |
Domantas Sabonis | 44.666667 | 39.86 | 42.314286 | 41.486538 | 45.164 | 48.549 | 46.507 | 48.401 | 48.42900 | 48.558 | 48.339 | 47.454 | 48.641 | 47.586 |
Nikola Vucevic | 41.933333 | 43.46 | 45.371429 | 41.504545 | 42.217 | 36.073 | 37.675 | 41.312 | 41.48200 | 42.468 | 42.038 | 38.837 | 41.434 | 44.360 |
Jayson Tatum | 29.600000 | 33.32 | 39.357143 | 39.728000 | 27.192 | 28.217 | 31.819 | 32.099 | 27.76600 | 27.866 | 27.914 | 30.589 | 28.662 | 28.145 |
Zach LaVine | 35.666667 | 38.22 | 40.357143 | 39.712727 | 40.118 | 41.449 | 42.567 | 42.143 | 38.98800 | 39.311 | 41.457 | 42.520 | 45.305 | 38.907 |
Shai Gilgeous-Alexander | 47.600000 | 43.04 | 43.800000 | 35.169091 | 49.146 | 46.772 | 48.614 | 47.751 | 45.73300 | 50.535 | 49.171 | 49.988 | 49.694 | 47.597 |
Once our model iterations have finished running, we can go ahead and take a look at the results. You'll likely notice that these are not nearly as similar to each other as our models last time. This is because none of these records were included in the training dataset, and as you can see, it makes a pretty big difference. Without the training dataset being included in the prediction dataset, there will be quite a bit more variability in predictions. And that's even with only including this small sample size of data.
dataset.sort_values('Game_Date').head()
Player | Match_Up | Game_Date | FP | Last3 | Last5 | Last7 | SeasonAve | |
---|---|---|---|---|---|---|---|---|
14396 | Troy Daniels | LAL @ LAC | 2019-10-22 | 6.5 | 6.5 | 6.5 | 6.5 | 7.640625 |
14374 | Jrue Holiday | NOP @ TOR | 2019-10-22 | 27.8 | 27.8 | 27.8 | 27.8 | 39.428261 |
14373 | Josh Hart | NOP @ TOR | 2019-10-22 | 30.5 | 30.5 | 30.5 | 30.5 | 23.997917 |
14372 | Jahlil Okafor | NOP @ TOR | 2019-10-22 | 12.4 | 12.4 | 12.4 | 12.4 | 17.424000 |
14371 | JaVale McGee | LAL @ LAC | 2019-10-22 | 11.4 | 11.4 | 11.4 | 11.4 | 20.578431 |
Now just for comparison to see how accurate it really is or isn't, we are going to create a new dataframe copy of our original dataset, and only slice the games these models predicted for, the games occuring on January 4th 2020.
datasetNew = dataset.copy()
datasetNew.set_index('Game_Date', inplace=True)
datasetNew = datasetNew[(datasetNew.index=='2020-1-4')]
datasetNew = datasetNew[['FP', 'Player']].set_index('Player')
datasetNew.head()
FP | |
---|---|
Player | |
Aaron Gordon | 31.4 |
Aaron Holiday | 32.8 |
Alec Burks | 43.9 |
Alex Len | 39.3 |
Allen Crabbe | 6.1 |
Now that we have the actual fantasy points scored on that day, lets go ahead and join our datset with all of our predictions to it. Notice here how we are using a right join, because we only want to look at records that we actually have a prediction for in our compareDF.
datasetNew = datasetNew.join(compareDF, how='right').drop(['Last3', 'Last5', 'Last7', 'SeasonAve'], axis=1)
datasetNew.head(25)
FP | predict_68 | predict_96 | predict_89 | predict_29 | predict_98 | predict_92 | predict_18 | predict_23 | predict_90 | predict_86 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Nickname | |||||||||||
Giannis Antetokounmpo | 46.1 | 48.144 | 51.306 | 57.146 | 52.281 | 57.86900 | 57.228 | 56.066 | 51.892 | 57.142 | 55.500 |
Luka Doncic | 69.4 | 49.734 | 51.617 | 45.863 | 53.768 | 47.28100 | 55.586 | 51.688 | 53.818 | 53.765 | 54.411 |
Andre Drummond | 51.1 | 37.932 | 41.594 | 39.394 | 38.562 | 47.09275 | 44.612 | 42.832 | 37.646 | 37.518 | 45.959 |
Trae Young | 59.8 | 41.868 | 40.024 | 41.513 | 35.700 | 39.10100 | 42.033 | 42.509 | 42.135 | 36.573 | 41.767 |
Nikola Jokic | 33.0 | 29.098 | 28.041 | 30.592 | 31.797 | 29.65100 | 29.863 | 28.514 | 29.532 | 28.612 | 28.342 |
Rudy Gobert | 37.4 | 39.017 | 37.541 | 36.724 | 35.832 | 37.04000 | 37.445 | 37.841 | 36.168 | 35.652 | 35.756 |
Bradley Beal | NaN | 38.560 | 39.879 | 35.365 | 38.174 | 40.55200 | 37.772 | 40.619 | 38.923 | 42.808 | 39.374 |
Brandon Ingram | 31.5 | 33.768 | 34.803 | 34.729 | 36.160 | 34.60700 | 35.315 | 36.080 | 34.915 | 38.133 | 33.720 |
LaMarcus Aldridge | 31.0 | 41.126 | 44.250 | 42.822 | 43.620 | 45.51000 | 45.242 | 45.381 | 43.653 | 46.194 | 44.115 |
Jrue Holiday | 35.5 | 35.170 | 32.030 | 34.808 | 31.468 | 29.57300 | 30.936 | 35.881 | 28.728 | 36.331 | 37.532 |
Domantas Sabonis | 45.2 | 45.164 | 48.549 | 46.507 | 48.401 | 48.42900 | 48.558 | 48.339 | 47.454 | 48.641 | 47.586 |
Nikola Vucevic | 45.6 | 42.217 | 36.073 | 37.675 | 41.312 | 41.48200 | 42.468 | 42.038 | 38.837 | 41.434 | 44.360 |
Jayson Tatum | 48.4 | 27.192 | 28.217 | 31.819 | 32.099 | 27.76600 | 27.866 | 27.914 | 30.589 | 28.662 | 28.145 |
Zach LaVine | 53.9 | 40.118 | 41.449 | 42.567 | 42.143 | 38.98800 | 39.311 | 41.457 | 42.520 | 45.305 | 38.907 |
Shai Gilgeous-Alexander | 32.4 | 49.146 | 46.772 | 48.614 | 47.751 | 45.73300 | 50.535 | 49.171 | 49.988 | 49.694 | 47.597 |
DeMar DeRozan | 40.3 | 43.944 | 44.242 | 41.317 | 40.031 | 37.95900 | 39.348 | 38.037 | 39.341 | 39.574 | 38.543 |
John Collins | NaN | 39.035 | 38.781 | 41.282 | 39.138 | 30.26600 | 30.218 | 30.925 | 38.464 | 38.677 | 40.370 |
Donovan Mitchell | 45.6 | 31.106 | 30.279 | 31.191 | 30.979 | 34.50400 | 37.221 | 34.993 | 30.174 | 34.680 | 30.527 |
Devonte' Graham | 51.9 | 35.197 | 34.223 | 36.450 | 34.733 | 34.64500 | 34.877 | 35.350 | 34.472 | 33.593 | 37.336 |
Jaylen Brown | 27.9 | 37.393 | 35.563 | 35.130 | 35.298 | 38.14000 | 36.627 | 33.855 | 34.418 | 34.936 | 35.874 |
De'Aaron Fox | 28.1 | 57.422 | 51.091 | 56.294 | 57.122 | 42.39900 | 57.492 | 56.329 | 57.131 | 47.274 | 41.196 |
Khris Middleton | 29.5 | 37.125 | 33.836 | 37.816 | 33.685 | 33.33900 | 33.149 | 33.111 | 33.251 | 32.625 | 35.212 |
Chris Paul | 22.8 | 38.434 | 41.518 | 37.202 | 38.693 | 38.52500 | 38.426 | 38.709 | 37.536 | 37.914 | 35.999 |
Gordon Hayward | 34.5 | 31.466 | 30.770 | 34.276 | 36.297 | 31.07300 | 32.245 | 30.507 | 35.598 | 35.456 | 36.008 |
Kevin Love | 14.1 | 35.106 | 34.257 | 35.201 | 32.794 | 35.41200 | 34.797 | 35.469 | 36.260 | 35.687 | 34.653 |
Now I won't bother going through and re-running the entire thing to calculate the error for the models instead of the scores, because in this realistic-ish scenario you won't have anything to compare to. However, if you notice there are a few records without an actual FP score, these are players that did not have an Out injury designation at the time the player list was pulled, but ended up not playing.
From here, you can run any type of analysis you would like, but in the next lesson we will be going over how to run the same setup without having to train your datasets every single time.