#HangItUpMike?

About a month ago, professional golfer Dawie Van Der Walt took to Twitter ripping on 2003 Masters Champion, Mike Weir, for taking up valuable (and limited) space in PGA Tour fields. Because of Mike’s victory in 2003, as well as his injuries over the past 10 years, he still is exempt, or sponsored, into many tournaments over the course of the season.Van Der Walt believes Mike is wasting time and taking away spots from younger, more talented individuals who could really use the opportunity to play on golf’s biggest stage.

Well I am here to take a look at Mike’s game, particularly his driving (or lack of), an area of the game many think he may have developed the yips, or as Ernie calls them… the “heebie-jeebies.”ernie-els-yip-at-cimb-classic

ernie

Below I made a nice plot of all the drives hit on tour this year (including the wrap-around season). The x-axis is relative distance from the center of the fairway. For example, a value of negative 5 means you were 5 yards left of the middle of the fairway. I separated Mike’s drives from the rest of the tour, as well as Rory McIlroy’s, for a reference.

Data provided by PGA Tour - ShotLink.
Data provided by PGA Tour – ShotLink.

In all seriousness, I was surprised at how good this looked for Weir. However when your eyes do make it over to the vertical axis it does get a little frightening. And considering how short Mike is hitting it, he does not seem to be making up for it in the accuracy department.

Because the above graph is so dense, it can be fairly misleading where the averages lie, so I made some summary statistics before we give Mike the final verdict.

Screen Shot 2016-06-29 at 1.40.04 AM

I think this says it all. He’s hitting significantly more crooked than both Rory and the rest of the tour, and he is hitting it well back of even the Tour Average.

…Being Canadian, this is tough for me to say but…#HangItUpMike

Random Forests Using Python – Predicting Titanic Survivors

The following is a simple tutorial for using random forests in Python to predict whether or not a person survived the sinking of the Titanic. The data for this tutorial is taken from Kaggle, which hosts various data science competitions.


RANDOM FORESTS:
For a good description of what Random Forests are, I suggest going to the wikipedia page, or clicking this link. Basically, from my understanding, Random Forests algorithms construct many decision trees during training time and use them to output the class (in this case 0 or 1, corresponding to whether the person survived or not) that the decision trees most frequently predicted. So clearly in order to understand Random Forests, we need to go deeper and look at decision trees.

Decision Trees:
Like most data mining techniques, the goal is to predict the value of a target variable (Survival) based on several input values. As shown below, each interior node of the tree is one of the input variables. How many edges come off each node tell you the possible values for that input variable. So when you are making a prediction, you simply must take an observation, and move your way down the tree until you come to a leaf, and this will tell you the prediction for this decision tree.

source: wikipedia
source: wikipedia

As you might be able to imagine, a single decision tree is grown very deep may learn very irregular patterns, and end up overfitting the training data. Thus making it a poor predictor when faced with new data. So what Random Forests do is use a bagging technique, where it builds multiple decision trees by repeatedly resampling training data with replacement. Thus we end up with multiple different decision trees based from the same data set. To predict, we can then run each observation through each tree and see what the overall ‘consensus’ is. In the next section, we are going to implement this algorithm.


BACK TO THE TITANIC:
Below are the descriptions for all the variables included in the data set:

VARIABLE DESCRIPTIONS
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation

First things first, let’s load in the data and take a look at the first few rows using pandas.

[python]
import pandas as pd
training = pd.read_csv(“train.csv”)
# the following will print out the first 5 observations
print(training.head())
[/python]

Looking at the head of data, it is clear that we are going to need to clean the data up a little bit. Also, we have a number of missing values, which I will simply replace with the median of the respective feature (there are of course better more in depth ways of doing this). Also, since each variable is going to end up being a ‘node’ in a decision tree, it is clear that our dataset must consist solely of numerical values. I do this for Sex, and Embarked. The code is displayed below, it will take either the training or test data set and clean it so it is ready for use by the Random Forest algorithm:

[python]
def clean_titanic(titanic, train):
# fill in missing age and fare values with their medians
titanic[“Age”] = titanic[“Age”].fillna(titanic[“Age”].median())
titanic[“Fare”] = titanic[“Fare”].fillna(titanic[“Fare”].median())
# make male = 0 and female = 1
titanic.loc[titanic[“Sex”] == “male”, “Sex”] = 0
titanic.loc[titanic[“Sex”] == “female”, “Sex”] = 1
# turn embarked into numerical classes
titanic[“Embarked”] = titanic[“Embarked”].fillna(“S”)
titanic.loc[titanic[“Embarked”] == ‘S’, “Embarked”] = 0
titanic.loc[titanic[“Embarked”] == “C”, “Embarked”] = 1
titanic.loc[titanic[“Embarked”] == “Q”, “Embarked”] = 2
if train == True:
clean_data = [‘Survived’, ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’, ‘Embarked’]
else:
clean_data = [‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’, ‘Embarked’]
return titanic[clean_data]
[/python]

Now with our data cleaned, we need to import the package that contains the Random Forest algorithm, clean our data, and fit our data:

[python]
from sklearn.ensemble import RandomForestClassifier
data= clean_titanic(training,True)
# create Random Forest model that builds 100 decision trees
forest = RandomForestClassifier(n_estimators=100)
X = data.ix[:, 1:]
y = data.ix[:, 0]
forest = forest.fit(X, y)
[/python]

So ‘forest’ is now holding our fitted model using the training data, before we move on and test this model on new data, we should first see how it performs in sample, to make sure everything went decently smoothly.

[python]
output = forest.predict(X)
[/python]

When we compare these predictions to the actual observed outcomes, we are correct with 98% accuracy.

When we use this on our test data set, which Kaggle provides, it scores around 75%, which is about mid-pack on the leaderboard. By the way, doing a logistic regression with this exact data scores about 2 percentage points higher.

We can also see which variable is most important to the data set. How this is calculated is pretty simple. The importance of the j’th feature is calculated by calculating the Out-of-Bag (OOB) error, which is the prediction error without including that feature in the decision trees and comparing this to the error when the feature is included. Obviously, the bigger this difference, the more important the j’th feature is to accurately predicting, let’s take a look at which variables are most important in our data set:

Screen Shot 2016-06-28 at 10.16.29 PM

Not surprisingly, we see that Fare, Age, and Gender are the most important features we have in our data that help determine whether or not somebody survived.

Here is the code for the graph:

[python]
import matplotlib.pyplot as plt

# first convert from panda DataFrame to python list
for column in data:
vars.append(column)
vars = vars[1:]
print(vars)
imps = []
for imp in importance:
imps.append(imp)
print(imps)
y_pos = np.arange(len(vars))

plt.barh(y_pos, imps, align=’center’, alpha=0.5)
plt.yticks(y_pos, vars)
plt.ylabel(‘Feature’)
plt.xlabel(‘Importance’)
plt.title(‘Feature Importance’)
plt.show()
[/python]

Okay, so that was our introduction to Random Forests, we are able to predict whether or not somebody survives the sinking of the titanic with 75% accuracy. Note there is still a lot that can be improved and tinkered here, I basically threw away half of the data set because they were strings.