Match Play – Data Exploration Exercise

It has never been easier to access and analyze PGA Tour golf stats. BUT, despite the abundance of stroke play data, there is no data readily available for match play analysis. Considering all the analysis that goes into the Ryder Cup, Presidents Cup, and WGC Match Play, there should at least be SOMEBODY out there analyzing some real data, instead of Brandel just saying that Jimmy Walker will win a match because he rotates through the ball well (even though he does). My discontent was so strong that I have actually taken the time to develop a web scraping script that extracts useful information on Match Play data back to 2008. The match play data is pulled off of, and the relevant player data (strokes gained, world rank) is taken off of

In the end I was able to assemble my own dataset by using a various number of data structures throughout the process. If your interested, I used the Beautiful Soup, pandas, and numPy libraries in Python. I have also attached all the code right here in zip file:

Python Scripts: Match_Play_Code

I was able to scrape the following information off of the internet using the above script for every match played from 2008 to 2016 at the WGC Match Play.

  • year match took place
  • name of each player involved in match
  • nationality of each player involved
  • world ranking of each player
  • strokes gained putting for each player at time of match
  • strokes gained driving for each player at time of match
  • and of course, who won the matchOkay, so finally we have some data that may give us some new insights on the mysterious world of match play.


    The rest of this post is going to introduce us visually to the never-viewed Match Play Data.

    So let’s get into it. First off, I am going to take a deeper look at Strokes Gained and see if it plays a role in predicting who will win the match. More interestingly, what do you think will be the stronger predictor, strokes gained putting, or strokes gained driving? I think it will be driving, just because I don’t know how you can play against a guy like DJ hitting it 350 down the middle every hole and not get demoralized, even if your Jordan Spieth.

    Below, I have created a figure that shows the difference between strokes gained putting on the x-axis, and the difference between strokes gained driving on the y-axis, each point corresponding to a specific match. A red x indicates a match that was lost by the player whose stats were being subtracted from. A green circle, as you probably guessed, means that the player whose perspective the axis are from, won the match. I also quickly did a logistic regression, and plotted the linear decision boundary on the plot. A point that lies on the decision boundary is classified as a “toss-up” by the algorithm, meaning that there is a 50% chance either player wins. Points above are predicted to be ‘wins,’ while points below the line are predicted ‘losses.’

    Screen Shot 2016-07-05 at 4.49.55 PM
    We do see that strokes gained (putting or driving) is associated with winning. There are a lot more green dots than red in the top right of the figure, and more red crosses than green in the bottom left. The decision boundary matches our intuition nicely. Our algorithm predicts that if a player has a higher strokes gained putting and strokes gained driving (or any combination that lands itself above the blue line), he is most likely to win the match. But which one is more associated with winning? Driving or putting? Well, the slope of the decision boundary is about -1.2. Suggesting that Strokes Gained Putting is slightly more important in deciding the outcome of the match. The idea being that a completely vertical line would indicate that only putting decides the outcome. Of course this is a backhanded analysis, but it will have to suffice.

Okay so strokes gained looks like it is a decent candidate for being a predictor, however World Rank is the one I think I would ultimately trust if I had to bet on a match. So below, I tried to visually assess if win probability is an increasing function of how far ahead you are in the OWGR (Official World Golf Ranking). So I divided the difference in world rankings in each match into various bins, which are on the x-axis. The height of each bar tells you what percentage of the time the higher ranked player won the match for that bin. For example, in matches where the world ranking separation was between 0 and 4 (first bin), the higher ranked player won 49% of the time. When the ranking difference was between 50 and 54, the higher ranked player won 71% of the time.

Screen Shot 2016-07-05 at 9.05.13 PM

Another way of thinking of this is the percentage of time that basing a prediction solely on world ranking points will give you the accurate prediction.
There is an upward trend in the data. As the difference in world rankings becomes larger, the percentage of time the higher ranked player wins also increases (usually). The exception is clearly the second last bin (please contact me if you can give me a reason). In case you are wondering, all of these bins have a sufficiently large and consistent amount of data. Despite the increasing trend, being over 60 world ranking positions ahead of someone only gives you a 70% chance of winning. Given that a 1 seed has never lost in NCAA Tournament, golf is looking pretty random. A more interactive version of this figure is located at this link.

Moving on…
Since this is a Match Play analysis, we need to at least address the Ryder Cup. We all know that the Europeans are “great” match play players, so they say. To shoot this idea down I just quickly looked at the track record of USA vs EUROPE from the past 8 years to see if this perception holds true in the WGC too.

Screen Shot 2016-07-05 at 5.13.05 PM
Well there you have it. This only leaves two possibilities

  1. The Ryder Cup is random,
  2. or the Euros are only good at ‘team’ match play and/or the US especially suck at team match play. (Tiger and Phil…)

tig and phil

Okay, so last thing we are going to look at is simply who is the BEST match play player over the past 8 years. The figure below simply plots wins against losses. Only players who have completed at least 5 matches were included. Some of the points are labelled, to get interactive version click on this link!
Screen Shot 2016-07-05 at 8.49.32 PM
As expected, we see the top guys fairly low and to the right, indicating lots of wins relative to losses (the darker the dot, the higher the win percentage). Some notables:

  • Rory McIlroy –> 22 – 8
  • Jason Day –> 21 – 6
  • Adam Scott –> 3 – 10   WOW!!
  • Jordan Spieth –> 8 – 3
  • Dustin Johnson –> 8 – 10
  • Rickie Fowler –> 10 – 6
  • Bubba Watson –> 12 – 7  *I thought he was “too nice” for match play*

Again, hit this link if you want to explore the data yourself. I really recommend it.


I did actually cut the data in half and trained a full (using all features) logistic regression on the data and tried to predict out of sample. The model managed to predict the outcomes of 60% of the matches correctly. Over 18 holes of golf I’m not sure it is possible to do much better without monitoring closely how well a player is playing leading into the event…hm.

Simple Linear Regression – What the heck are we estimating?

It is likely that you’ve heard a coefficient from a linear regression described in several ways: “marginal effect”, “average marginal effect”, or “marginal effect on the average”, to name a few. Well, which is it? Or, are they all valid interpretations? The answer of course, is it depends.

Consider the simple linear regression model y_{i} = \beta_{0} + x_{i}\beta_{1} + u_{i} , where \beta_{0}, \beta_{1} are the population regression coefficients. That is, \beta = E(X_{i}X_{i}')^{-1} E(X_{i}y_{i}), where X_{i}'=[1 \   x_{i}]. Then, u_{i} is defined as y_{i} - \beta_{0} + x_{i}\beta_{1}. Note that u_{i} is by construction uncorrelated with x_{i} – this is a property of regression (and is easily verifiable by looking at first-order condition of the least-squares minimization problem).

This may sound off the alarm bells. Isn’t the main concern when performing regression analysis that there may be a correlation between x_{i} and u_{i}? Yes, this is a concern. However you are thinking of x_{i} and u_{i} from a structural equation, that is, one that represents a causal relationship. Suppose the structural equation is y_{i} = \gamma_{0} + \gamma_{1}x_{i} + \epsilon_{i}, where \gamma_{1} is the causal effect of x on y. Because this is not a regression (i.e. \gamma is not a regression coefficient) there is no restriction on the relationship between x_{i} and \epsilon_{i}. In most econometrics texts, regression is introduced in the context of a causal model, and thus E[\epsilon_{i}x_{i}]=0 is stated as an assumption. And then, when this assumption does not hold we say that the regression estimates are “not consistent” estimators of the true parameters. Really, regression estimates are always consistent – they are consistent for the population regression coefficients. When the error term of the causal model is uncorrelated with x, then the structural equation and the population regression are identical (i.e. \beta=\gamma). When the assumption fails, it is not the case that regression gives you inconsistent estimates, it’s simply that the population regression you are estimating is not the same as the causal relationship you are interested in.

Okay, that detour gives a lot of insight into how to interpret a regression coefficient. It’s worth repeating that regression is purely a statistical relationship; it does not (necessarily) represent any structural or causal relationship between y and x. Regression is a mechanical exercise and can be applied to any set of variables. So how to interpret \beta_{1}? First, let’s not assume anything about what the structural model actually is. If E[y|x] is in fact a linear function, then X'\beta will be this conditional expectation function, and thus the interpretation of \beta_{1} will be the marginal effect of x on E[y|x]. Or, loosely speaking, the “marginal effect on the average”. This is how you should interpret regression coefficients when you make no assumption concerning the underlying causal model (the only assumption I made was that E[y|x] was linear – this does not concern the causal model – and if is in fact nonlinear, then X'\beta provides the best linear approximation to E[y|x].)

Now, suppose the true structural equation is the one I’ve specified earlier (and that E[x_{i}\epsilon_{i}]=0, so that \beta=\gamma). Then, our OLS estimate \beta_{1} can be interpreted as the causal effect of x on y (“marginal effect”). So, there are times when a regression estimate can be thought of as a marginal effect of x on y ; namely, when the structural relationship is linear, the causal effect is constant (the same for all individuals), and x is uncorrelated with all other variables that affect y.

I think introducing regression in a causal framework does a great disservice to students. It gives students the impression that causality and regression are intrinsically linked to each other, and clouds the fact that regression is a purely statistical exercise.

I guess I never made it to when regression estimates could be thought of as “average marginal effects” – this requires thinking about a structural model where the causal effect differs for individuals – this can be saved for a later post!

Source: Mostly Harmless Econometrics (Angrist and Pischke 2009)