This data visualization explores the variation in scores on the PGA Tour in 2016. Scores vary because of differences between players, courses, and the day-to-day variation inherent to golf. There is a lot of information here, so take your time to understand it. Click here to get started.
The Statistical Details
This visualization starts by asking you to choose whether you want to analyze things by player or course. I’ll go through the player decomposition first, as it is more straightforward.
The starting point is the Law of Total Variance:
where Y and X are random variables. The first term on the right hand side is called the “explained variance” (that is, explained by X) and the second term is the “unexplained variance”. In the viz, we refer to these terms as the “Between” and “Within” components, respectively.
In our context, Y is adjusted strokes-gained in a given round, and X will be a vector of player indicator variables. So, the “Between” component is the variance in players’ adjusted scoring averages (E[Y|X]) and the “Within” component is the average variance of individual players’ scores.
To estimate these 2 components is simple; linear regression can do it for us. To see this, consider the regression:
where Y and X are defined as above, and U is the regression residual (i.e. uncorrelated with X by construction). Because X is a vector of indicator variables, we know that ; this is a property of regression that we can invoke because we know the conditional expectation is linear when X is a set of dummy variables. So, we have:
where the last line follows because E[U|X] is zero by construction in regression, and Var(Y|X) = Var(U|X) by the definition of U. Long story short, to get the aforementioned “Between” and “Within” components, we simply regress adjusted scores on a set of player dummies, and then the variance of the fitted values is our “Between” component, and the variance of the residuals is our “Within” component. When we express them as percentages, the “Between” component percentage is just the R-squared.
Okay, moving on. Next, for each player, we break down their “Within” variance. For each round we know that:
where all of these terms are adjusted for course difficulty. So, we want to break down the total variance in a given player’s adjusted scoring into the components contributed by each part of the game. Consider this:
Then, we say the fraction of variance in SGTOT due to variation in SGOTT is equal to:
This has been coined an “ensemble” decomposition. Notice that:
So we attribute one of each of the covariance terms to SGOTT (recall that if you write out the Var(SGTOT) you would have 2 of each of the covariance terms in the above expression). If covariance terms are small, then the contribution of SGOTT would be simply:
which is very intuitive. This decomposition is done for every player who has at least 30 rounds played in 2016. In practice, the covariance terms do matter a bit for the within player decompositions.
To finish this section off, we need to briefly discuss the “Between” decomposition. It proceeds exactly as above, except each data point is a player’s year-long average. Here, the covariance terms are negligible. The % contribution of each SG category tells us how much of the variance in year-long SG averages are due to each SG category.
For the course decomposition we are using raw scores.
To break things into “Between” and “Within” course variation, we regress raw scores on a set of course dummy variables.
The decompositions proceed in the same manner as described for the Player section. However, I do want to explain how we calculate the course averages for each category and how to interpret them. With players, this is very simple – just the year-long averages in each SG category, and total SG (all adjusted for course-difficulty. For courses, obtaining interpretable averages is a little more involved.
First, I start out with the baseline strokes-gained numbers (both total and for each category). Baseline SG is how much better you are playing than a “baseline function” which uses historical PGA Tour data to estimate the average number of strokes it takes to hole out from each distance and location (fairway, rough, sand, etc.). At easier courses, baseline SG for the field will have a positive average; this means that all players are on average gaining strokes relative to the baseline function (so, for example, a 400 yard hole at this course is easier than the typical 400 yard hole). An important point to account for is the fact that fields are not the same quality at all courses; therefore we correct for this by estimating fixed effects regressions for each SG category. This proceeds in an analogous manner to that discussed here in our predictive model setup. Take the SG:OTT category; a given course fixed effect from this regression will be the average strokes-gained off-the-tee relative to the baseline function for a typical field, at that course.
So now, for each course, we have the raw scoring average adjusted for field strength, and also the baseline strokes-gained for each SG category, also adjusted for field strength. But, the sum of the adjusted baseline SG category averages does not equal the adjusted raw scoring average. Why is that? The discrepancy is due to the differing distances of courses. Remember that the baseline function takes only distance and type of shot as inputs. When considering the total baseline strokes-gained at a course, all that is relevant is distance (as every hole starts from the same spot, a tee box). Therefore, if 2 courses both have the same baseline total strokes-gained averages, then any difference in the raw scoring averages has to be due to distance. So, the contribution of distance to the adjusted raw scoring average is defined as a residual; the number of strokes not accounted for by total baseline SG average.
Let’s go through an example to make this a little more concrete.
The raw scoring average at TPC Sawgrass in 2016 was 72.05. This is 0.96 strokes higher than the annual raw scoring average on Tour in 2016 (71.09). So what makes up this 0.96 stroke difference in raw scoring average?
First, the adjustment for field strength shows that the PLAYERS field in 2016 was 0.51 strokes better than a typical field (that is, the average player in the field would be expected to gain 0.51 strokes over a typical Tour player on the same course). Therefore, we should expect, all things equal, that this field would be 0.51 strokes better than the Tour average. But, we in fact observed a raw scoring average that was 0.96 strokes higher than the Tour average. Therefore, we now have a discrepancy of 0.96 + 0.46 = 1.47 strokes to account for.
Next, the fixed effects regressions show that, at TPC Sawgrass in 2016, baseline SG:OTT average is 0.21 strokes harder, SG:APP average is 0.51 strokes harder, SG:ARG average is 0.39 strokes harder, and SG:PUTT average is 0.35 strokes harder, at TPC Sawgrass than at the typical course in 2016. Putting these numbers together, we have a total strokes-gained relative to baseline that is 1.46 strokes harder than average. Therefore, we have only a 0.01 stroke discrepancy between the adjusted raw scoring average and the adjusted total baseline SG; as explained previously, this difference must be due to the distance of the course. Because TPC Sawgrass is almost exactly an average length course on Tour, its distance contributes basically nothing (0.01 strokes) to its differential raw scoring average.
And there you have it – we have broken the 0.96 strokes higher raw scoring average at TPC Sawgrass into 6 components: 1) field strength (0.51 strokes lower), 2) Baseline SG:OTT (0.21 strokes higher), 3) Baseline SG:APP (0.51 strokes higher), 4) Baseline SG:ARG (0.39 strokes higher), 5) Baseline SG:PUTT (0.35 strokes higher), and 6) Distance of the course (0.01 strokes higher).
A final note is that we exclude field strength differences in the reported ‘Between Course’ decomposition, as well as the ‘2016 Averages’, in the data viz. We think this is just more informative; we want to know how difficult each part of the course is, and field strength only serves to cloud this information.