• Analytics Blog

LAST UPDATED

Nov 22nd, 2018

Data Golf predictive model: methodology

- Last Updated: November 22, 2018

[Contents:
Introduction,
Adjusting scores,
Predicting scores using total strokes-gained,
Incorporating detailed strokes-gained categories,
Course history/fit,
Model selection,
Adapting model for live predictions]

Introduction

In this document we describe the current methodology behind our predictive model and
discuss some interesting ideas and problems with prediction in golf more generally.
We have previously written about our first attempt at modelling golf
here,
which I would recommend reading but is not necessary to follow the contents of this article. This document
is a little more technical than the previous one, so if you are struggling to follow along here it is probably
worth reading the first methodology blog.

The goal of this prediction exercise is to estimate probabilities of certain finish positions in golf tournaments (e.g. winning, finishing in the top 10). We are going to obtain these estimates by specifying a probability distribution for each golfer's scores. With those distributions in hand, the probability of any tournament result can be estimated through simulation. Let's dig in to the details.

We model each golfer's performance as normally distributed with some unknown mean and variance. These means can be thought of as the current "ability" of each golfer. Performance in golf is only meaningful in relation to other golfers: a 72 on one golf course could indicate a very different performance than a 72 on a different course. Therefore, throughout this analysis we focus on an adjusted strokes-gained measure (i.e. how many strokes better you were than some benchmark) that allows for direct comparisons of performance on any course. To return to our simple probability model of a golfer's performance, we can now more specifically say that we are modelling each golfer's*adjusted strokes-gained
in a given round* as normally distributed with some mean and some variance.

The goal of this prediction exercise is to estimate probabilities of certain finish positions in golf tournaments (e.g. winning, finishing in the top 10). We are going to obtain these estimates by specifying a probability distribution for each golfer's scores. With those distributions in hand, the probability of any tournament result can be estimated through simulation. Let's dig in to the details.

We model each golfer's performance as normally distributed with some unknown mean and variance. These means can be thought of as the current "ability" of each golfer. Performance in golf is only meaningful in relation to other golfers: a 72 on one golf course could indicate a very different performance than a 72 on a different course. Therefore, throughout this analysis we focus on an adjusted strokes-gained measure (i.e. how many strokes better you were than some benchmark) that allows for direct comparisons of performance on any course. To return to our simple probability model of a golfer's performance, we can now more specifically say that we are modelling each golfer's

An obvious, but critical, point is that our measure of performance is in units
of strokes per round. Strokes relative-to-the field are the currency of the game of golf: this
decides who wins golf tournaments. If we can accurately
specify each golfer's probability distribution of strokes-gained relative to some
benchmark, then we can accurately estimate probabilities of certain
events occuring in golf tournaments [1].

Adjusting Scores

The overall approach we take can be broken down as follows: first, we adjust raw scores from all professional golf tournaments
to obtain a measure of performance that is not confounded by the difficulty of the course it was played on. Second,
we use various statistical methods to estimate the player-specific means and variances (mentioned above) using all available data before
a round is played. Third and finally, we use these estimates to simulate golf tournaments and obtain the probabilities of interest.

Let's talk first about how to convert a set of raw scores into the more interpretable adjusted strokes-gained measure.
The approach we take roughly follows
Connolly and Rendleman (2008). We estimate the following regression:
*i* indexes the golfer and *j* indexes a tournament-round (or a round played
on a specific course for multi-course tournaments), \( \normalsize S_{ij} \) is the raw score in a given tournament-round,
\( \normalsize \mu_{i}(t) \) is some player-specific function of "golf time" (i.e. the sequence of rounds for a golfer), and
\( \normalsize \delta_{j} \) is the coefficient from a dummy variable for tournament-round *j*. This regression
produces estimates of each golfer's ability at each point in time (\( \normalsize \mu_{i}(t) \)) and of
the difficulty of each course in each tournament-round (\( \normalsize \delta_{j} \)). \( \normalsize \mu_{i}(t) \) could be any function:
one simple functional form could just be a constant, which would mean we force each player's ability
to be constant throughout our sample time period (a strong assumption). In practice, we fit second or third
order polynomials, depending how many data points the player has in our sample, which allows each golfer's ability
to vary flexibly over time.
All that we actually care about from this regression are the estimates
of course difficulty, as we define our adjusted strokes-gained measure as \( \normalsize S_{ij} - \delta_{j} \).
The interpretation of a single \( \normalsize \delta_{j} \) is the expected score for some reference
player at the course tournament-round *j* was played on. (More intuitively, \( \normalsize \delta_{j} \) can be
thought of as the field average score in round *j* after accounting for the skill of each golfer in
that field.) Therefore, our adjusted strokes-gained
measure is interpreted as the performance relative to that reference point. The choice of a
reference player is arbitrary and not of great importance, so we typically make everything
relative to the average PGA Tour player in a given year. A final point about this specification:
there are no course-player effects (i.e. players are not allowed to "match" better with certain courses
than others). With respect to obtaining consistent estimates of the \( \normalsize \delta_{j} \) (which is our only goal here), this is likely not too
important [2].

$$ \normalsize (1) \>\>\>\>\>\>\>\> S_{ij} = \mu_{i}(t) + \delta_{j} + \epsilon_{ij} $$

where
With our adjusted strokes-gained measure in hand, the next step is to estimate the golfer-specific parameters: the mean and the variance
of their scoring distributions (at each point in time).
It would seem that the function \( \normalsize \mu_{i}(t) \) would be a good candidate for an estimate
of the mean of player *i*'s scoring distribution
[3]. It may be useful for some settings, but when your goal is
predicting out-of-sample, I don't think it is. Rather, we are going to estimate our player-specific means
using regression and backtesting.
It's worth noting that this method is not quite internally consistent. We require estimates of player ability
at each point in time to estimate the course difficulty parameters (\( \normalsize \delta_{j} \)), but we
do not actually use the player ability estimates from (1) to make predictions
[4].
You can think of the purpose of estimating (1) as only to recover the course difficulty parameters (\( \normalsize \delta_{j} \)),
from which we can calculate an adjusted strokes-gained measure for each round played in our sample.
The remainder of this document is concerned with how best
to predict these adjusted strokes-gained values with the available data at the time each round is played.

Predicting scores using historical total strokes-gained

In this section we give the overview of our predictive model and in the following two sections we
discuss the (potential) addition of a couple other features to the model.

The estimating sample includes data from 2010-onwards on the PGA Tour, Web.com Tour, and European Tour.
We use a regression framework to predict a golfer's adjusted score in a
tournament-round using only information available up to that date. This seems to be a good fit
for our goals with this model (i.e. predicting out-of-sample), while you could maybe argue
the model in (1) would be better at describing data in-sample.
In this first iteration of the model, the main input to predict strokes-gained is a golfer's
historical strokes-gained (seems logical enough, right?). We expect that recent strokes-gained performances are more
relevant than performances further into the past, but we will let the data decide
whether and to what degree that is the case. For now, suppose we have a weighting scheme:
that is, each round a golfer has played moving back in time has been assigned a weight.
From this we construct a weighted average
and use that to predict a golfer's adjusted strokes-gained in their
next tournament-round. Also used to form these predictions are the number of rounds that
the weighted average is calcuated from, and the number of days since a golfer's last tournament-round.
More specifically, predictions are the fitted values from a regression of adjusted strokes-gained
in a given round on the set of predictors (weighted average SG up to that point in time,
rounds played up to that point in time, days since last tournament-round) and various interactions of these predictors.
The figure below summarizes the predictions the model makes: we plot fitted values as a function of how many rounds a golfer has played
for a few different values of the weighted strokes-gained average:

There are a couple main takeaways here. First, even for golfers who have played a lot
(i.e. 150 rounds or more), there is some regression to the mean. That is,
if a golfer has a weighted average of +2 then our prediction for their next
tournament-round might be just +1.8.
Importantly, how much regression to the mean is present depends on
the weighting scheme. Longer-term weighting schemes (i.e. those that don't weight recent
rounds that much more than less recent ones) exhibit less regression to the mean, while shorter-term weighting schemes
exhibit more. This makes sense intuitively, as we would expect short-term
form to be less predictive than long-term form. However, what might be a little
less intuitive is the fact that these shorter-term weighting schemes can
outperform the longer-term ones. The reason is that although short-term
form is not as predictive as long-term form — in the sense that a 1 stroke increase in scoring average
over a shorter time horizon does not translate to an average increase in 1 stroke moving
forward, while a 1 stroke increase in long-term form more or less does — there is more
variance in short-term form across players
[5].

The second takeaway is the pattern of discounting as a function of the number of rounds played.
As you would expect, the smaller
the sample of rounds we have at our disposal, the more a golfer's past performance
is regressed to the mean. As the number of rounds goes to zero, our predictions converge
towards about -2 adjusted strokes-gained. It should also be pointed out that another input
to the model is which tour (PGA, Euro, or Web.com) the tournament is a part of: this has an impact
on very low-data predictions, as rookies / new players are generally of different quality on different tours.

The predicted values from this regression are our estimates for the player-specific means. What about player-specific variances? These are estimated by analyzing the residuals from the regression model above. The residuals are used because we want to estimate the part of the variance in a golfer's adjusted scores that is not due to their ability changing over time. We won't cover the details of estimating player-specific variances, but will make general two points. First, golfers for whom we have a lot of data have their variance parameter estimated just using their data, while golfers with less data available have their variance parameters estimated by looking at similar golfers. Second, estimates of variance are not that predictive (i.e. high-variance players in 2017 will tend to have lower variances in 2018). Therefore, we regress our variance estimates towards the tour average (e.g. a golfer who had a standard deviation of 3.0 in 2018 might be given an estimate of 2.88 moving forward).

With our assumption of normality, along with estimates (or, predictions) of each golfer's mean adjusted strokes-gained and the variance in their adjusted strokes-gained, we can now easily simulate a golf tournament. Each iteration draws a score from each golfer's probability distribution, and through many iterations we can define the probability of some event (e.g. golfer A winning) as the number of times it occured divided by the number of iterations.

The predicted values from this regression are our estimates for the player-specific means. What about player-specific variances? These are estimated by analyzing the residuals from the regression model above. The residuals are used because we want to estimate the part of the variance in a golfer's adjusted scores that is not due to their ability changing over time. We won't cover the details of estimating player-specific variances, but will make general two points. First, golfers for whom we have a lot of data have their variance parameter estimated just using their data, while golfers with less data available have their variance parameters estimated by looking at similar golfers. Second, estimates of variance are not that predictive (i.e. high-variance players in 2017 will tend to have lower variances in 2018). Therefore, we regress our variance estimates towards the tour average (e.g. a golfer who had a standard deviation of 3.0 in 2018 might be given an estimate of 2.88 moving forward).

With our assumption of normality, along with estimates (or, predictions) of each golfer's mean adjusted strokes-gained and the variance in their adjusted strokes-gained, we can now easily simulate a golf tournament. Each iteration draws a score from each golfer's probability distribution, and through many iterations we can define the probability of some event (e.g. golfer A winning) as the number of times it occured divided by the number of iterations.

Incorporating detailed strokes-gained categories

The model above is fairly simple (which is a good thing).
But, given that total strokes-gained can be broken down into 4 categories, each of which
is (very conveniently) expressed in units of strokes per round, a logical next step
is to make use of this breakdown when attempting to predict total strokes-gained.
This will improve predictions if certain categories are more predictive
than others. For example, if strokes-gained off-the-tee (SG:OTT) is very predictive
of future SG:OTT, while strokes-gained putting (SG:PUTT) is not that predictive
of future SG:PUTT, then we should have different predictions for two players
who have both been averaging +2 total strokes-gained, but have achieved this
differently. More specifically, we would tend to predict that the golfer who
has gained the majority of those 2 strokes from his off-the-tee play to
stay near +2, while a golfer who gained the majority of their strokes through putting
would be predicted to move away from +2 (i.e. regress towards the mean).

Because of the fact that total strokes-gained equals the sum of its parts
(off-the-tee (OTT), approach (APP), around-the-green (ARG), and putting (PUTT)),
we can do some nice regression exercises. Consider the following regression:
*i* in tournament-round
*j*, and the 4 regressors are all defined similarly as some weighted average for
each category using all rounds up to but not including round *j* [6]. Therefore,
in this regression we are predicting total strokes-gained using a golfer's historical averages
in each category (all of which are adjusted [7]). We can also run 4 other regressions where we replace the dependent variable
here \( \normalsize (TOTAL_{ij} \)) with the golfer's performance in round *j* in each strokes-gained category (OTT, APP, etc.). This will have the nice property
that the 4 coefficients on \( \normalsize OTT_{i, -j} \) (for example) from the latter 4 regressions will add
up to \( \normalsize \beta_{1} \) in the regression above.

$$ \normalsize (2) \>\>\>\>\>\>\>\>TOTAL_{ij} = \beta_{1}\cdot OTT_{i,-j} + \beta_{2} \cdot APP_{i,-j} + \beta_{3} \cdot ARG_{i,-j} + \beta_{4} \cdot PUTT_{i,-j} + u_{ij} $$

where \( \normalsize TOTAL_{ij} \) is total adjusted strokes-gained for player
So what do we find? The coefficients are, roughly, \( \normalsize \beta_{1} = 1.2 \), \( \normalsize \beta_{2} = 1 \),
\( \normalsize \beta_{3} = 0.9 \), and \( \normalsize \beta_{4} = 0.6 \). Recall their interpretation: \( \normalsize \beta_{1} \) can be thought of
as the predicted increase in total strokes-gained from having a historical average SG:OTT that is 1 stroke higher,
*holding constant the golfer's historical performance in all other SG categories*. Therefore, the fact that
\( \normalsize \beta_{1} \) is greater than 1 is very interesting (or, worriesome?!). Why would a 1 stroke increase in
historical SG:OTT be associated with a greater than 1 stroke increase in future total strokes-gained? We can
get an answer by looking at our subregressions: using \( \normalsize OTT_{ij} \) as the dependent variable, the coefficient
is close to 1 (as we would perhaps expect), using \( \normalsize APP_{ij} \) the coefficient is 0.2, and for the other two categories
the coefficients are both roughly 0. So, if you take these estimates seriously (which we do; this is a robust result),
this means that historical SG:OTT performance has predictive power not only for future SG:OTT performance, but also
for future SG:APP performance. That is interesting. This means that for a golfer who is currently averaging
+1 SG:OTT and 0 SG:APP, we should predict their future SG:APP to be something like +0.2. A possible story here is that
a golfer's off-the-tee play provides some signal about a golfer's general ball-striking ability (which we would
define as being useful for both OTT and APP performance). The other coefficients
fall in line with our intution: putting is the least predictive of future performance.

How can we incorporate this knowledge into our predictive model to improve it's performance? The main takeaways from the work above is that the strokes-gained categories differ in their predictive power for future strokes-gained performance (with*OTT > APP > ARG > PUTT*). However, a difficult practical issue is that we only have data on detailed
strokes-gained performance for a subset of our data: namely PGA Tour events that have ShotLink set up on-site.
We incorporate our findings above by using a reweighting method for each round that has detailed strokes-gained data available;
if the SG categories aren't available, we simply use total strokes-gained. In this reweighting method, if there were two rounds that both
were measured as +2 total strokes-gained, with one mainly due to off-the-tee play while the other was mainly due to putting, the former
would be increased while the latter would be decreased.
To determine which weighting works best, we just evaluate out-of-sample fit (discussed below). That's
why prediction is relatively easy, while casual inference is hard.

How can we incorporate this knowledge into our predictive model to improve it's performance? The main takeaways from the work above is that the strokes-gained categories differ in their predictive power for future strokes-gained performance (with

Incorporating course fit (or not?)

We have argued in the
past that there is no statistically responsible way to incorporate
course history into a predictive model. But, after watching the Americans get slaughtered
at the 2018 Ryder Cup at Le Golf National in France, we came away thinking that course
fit was something we had to try to incorporate into our models. Spoiler alert: we tried
two approaches, and failed. In the first
approach, we tried to correlate a golfer's historical strokes-gained performance in the different categories (OTT, APP, etc.)
with that golfer's performance at a specific course. The logic here is that perhaps
certain courses favor players who are good drivers of the ball (i.e. good SG:OTT [8]), while other courses
favour players who are better around the greens (i.e. good SG:ARG). For some courses we have a reasonable
amount of data (e.g. 9 years worth for events that have been hosted on the same course
since 2010). The problem is that, even for these relatively high data courses,
the results are still very noisy.
It is true that
if you run the regression as in (2) separately for each course in the data,
you will find results that are different (to a *statistically significant* degree) from
the baseline result in (2). For example, instead of SG:APP having a coefficient of 1,
for some courses we will find it has a coefficient of 0.5. Should we take these estimates
at face value? No, I don't think so. Statistical significance is not very meaningful
at the best of times, and especially not when you are running many regressions:
of course you will find some statistically significant
differences if you have 30 courses in your data and 4 variables per regression. The ultimate
proof is in whether this additional information improves your out-of-sample predictive
performance, and in our case it did not.

Our second attempt at incorporating course fit involved trying to group courses
together that have similar characteristics, and then essentially doing a course history
exercise except using a golfer's historical performance on the group of similar courses instead of just
a single course. We have done
this exercise before using course groupings based off course length.
This time we tried grouping courses using clustering algorithms,
where the main characteristics again involved the detailed strokes-gained categories (e.g. the % of variance
in total scores that was explained by each category).
Ultimately I do think this is the way to go if you want to
incorporate course fit: if you had detailed course data (perhaps about average fairway width,
length, etc.) you could potentially make more natural groupings than we did. Unfortunately in our
case, with the course variables we used, it was again mostly a noise mine. This has left us thinking that
there is not an effective way to systematically incorporate course fit into
our statistical models. The sample sizes are too small, and the measures of
course similarity to crude, to make much headway on this problem. That's not to say
that course history doesn't exist; it probably does. But to separate the signal
from the noise is very hard.

Model evaluation and selection

Given the analysis and discussion so far, we can now think of having a set of models to choose from
where differences between models are defined by a few parameters. These parameters are
the choice of weighting scheme on the historical strokes-gained averages (this involves just a single parameter that determines the rate of exponential decay
moving backwards in time), and also the weights that are used to incorporate the detailed strokes-gained
categories through a reweighting method.

The optimal set of parameters are selected through brute force: we loop through all possible combinations of parameters,
and for each set of parameters we evaluate the model's performance through a cross validation exercise.
This is done to avoid overfitting: that is, choosing a model that fits the estimating data very well but does not
generalize well to new data. The basic idea is to divide your data into a "training" set
and a "testing" set. The training set is used to estimate the parameters of your model (for our model,
this is basically just a set of regression coefficients [9]),
and then the testing set is used to evaluate the predictions
of the model. We evaluate the models using mean-squared prediction error, which in this context is defined as
the difference between our predicted strokes-gained and the observed strokes-gained, squared and then averaged.
Cross validation involves repeating this process several times (i.e. dividing your sample into training and
testing sets) and averaging the model's performance on the testing sets.
This repetitive process is again done to avoid overfitting. The model that performs the best in the cross validation
exercise should (hopefully) be the one that generalizes the best to new data. That is, after all, the goal of
our predictive model: to make predictions for tournament outcomes that have not occurred yet.

One thing that becomes clear when testing different parameterizations is how similar they perform overall
despite disagreeing in their predictions quite often.
This is troubling if you plan to use your model
to bet on golf. For example, suppose you and I both have models that
perform pretty similar overall (i.e. have similar mean-squared prediction error), but also
disagree a fair bit on specific predictions. This means that both of our models would find what
we perceive to be "value" in betting on some outcome against the other's model. However, in reality,
there is not as much value as you think: roughly half of those discrepancies will be cases
where your model is "incorrect" (because we know, overall, that the two models fit the data similarly).
This is not exactly a deep insight: it simply means that to assume your model's odds
as *truth* is an unrealistic best-case scenario for calculating expected profits.

The model that we select through the cross validation exercise has a weighting scheme that I would classify as "medium-term": rounds played 2-3 years ago do receive non-zero weight, but the rate of decay is fairly quick. Compared to our previous models this version responds more to a golfer's recent form. In terms of incorporating the detailed strokes-gained categories, past performance that has been driven more by ball-striking, rather than by short-game and putting, will tend to have less regression to the mean in the predictions of future performance.

The model that we select through the cross validation exercise has a weighting scheme that I would classify as "medium-term": rounds played 2-3 years ago do receive non-zero weight, but the rate of decay is fairly quick. Compared to our previous models this version responds more to a golfer's recent form. In terms of incorporating the detailed strokes-gained categories, past performance that has been driven more by ball-striking, rather than by short-game and putting, will tend to have less regression to the mean in the predictions of future performance.

Adapting model for live predictions

To use the output of this model — our pre-tournament estimates of the mean and variance
parameters that define
each golfer's scoring distribution — to make live predictions as a golf
tournament progresses, there are a few challenges to be addressed.

First, we need to convert our round-level scoring estimates to hole-level scoring estimates. This is accomplished using an approximation which takes as input our estimates of a golfer's round-level mean and variance and gives as output the probability of making each score type on a given hole (i.e. birdie, par, bogey, etc.).

First, we need to convert our round-level scoring estimates to hole-level scoring estimates. This is accomplished using an approximation which takes as input our estimates of a golfer's round-level mean and variance and gives as output the probability of making each score type on a given hole (i.e. birdie, par, bogey, etc.).

Second, we need to take into account the course conditions for each golfer's
remaining holes. For this we track the field scoring averages
on each hole during the tournament, weighting recent scores
more heavily so that the model can adjust quickly to
changing course difficulty during the round. (Of course, there is a tradeoff here
between sample size and the model's speed of adjustment.) Another important detail in
a live model is
allowing for uncertainty in future course conditions. This matters mostly
for estimating cutline probabilities accurately, but does also matter for
estimating finish probabilities. If a golfer has 10 holes
remaining, we allow for the possibility that these remaining 10 holes
play harder or easier than they have played so far (due to wind picking up
or settling down, for example). We incorporate this uncertainty
by specifying a normal distribution for each hole's future scoring average, with
a mean equal to it's scoring average so far, and a
variance that is calibrated from historical data [10].

The third challenge is updating
our estimates of player ability as the tournament progresses. This can be
important for the golfers that we had very little data on pre-tournament.
For example, if for a specific golfer we only have 3 rounds to make
the pre-tournament prediction, then by the fourth round of the tournament
we will have doubled our data on this golfer! Updating the estimate
of this golfer's ability seems necessary. To do this, we have a rough model
that takes 4 inputs: a player's pre-tournament prediction, the number of
rounds that this prediction was based off of, their performance
so far in the tournament (relative to the appropriate benchmark),
and the number of holes played so far in the tournament. The predictions
for golfers with
a large sample size of rounds pre-tournament will not be adjusted very
much: a 1 stroke per round increase in performance during the tournament translates
to a 0.02-0.03 stroke increase in their ability estimate (in units of
strokes per round). However, for a very low data player, the ability update could be
much more substantial (1 stroke per round improvement could translate to 0.2-0.3 stroke updated ability
increase).

With these adjustments made, all of the live probabilities of interest can be estimated through simulation. For this simulation, in each iteration we first draw from the course difficulty distribution to obtain the difficulty of each remaining hole, and then we draw scores from each golfer's scoring distribution taking into account the hole difficulty.

With these adjustments made, all of the live probabilities of interest can be estimated through simulation. For this simulation, in each iteration we first draw from the course difficulty distribution to obtain the difficulty of each remaining hole, and then we draw scores from each golfer's scoring distribution taking into account the hole difficulty.