Simple Linear Regression – What the heck are we estimating?

It is likely that you’ve heard a coefficient from a linear regression described in several ways: “marginal effect”, “average marginal effect”, or “marginal effect on the average”, to name a few. Well, which is it? Or, are they all valid interpretations? The answer of course, is it depends.

Consider the simple linear regression model y_{i} = \beta_{0} + x_{i}\beta_{1} + u_{i} , where \beta_{0}, \beta_{1} are the population regression coefficients. That is, \beta = E(X_{i}X_{i}')^{-1} E(X_{i}y_{i}), where X_{i}'=[1 \   x_{i}]. Then, u_{i} is defined as y_{i} - \beta_{0} + x_{i}\beta_{1}. Note that u_{i} is by construction uncorrelated with x_{i} – this is a property of regression (and is easily verifiable by looking at first-order condition of the least-squares minimization problem).

This may sound off the alarm bells. Isn’t the main concern when performing regression analysis that there may be a correlation between x_{i} and u_{i}? Yes, this is a concern. However you are thinking of x_{i} and u_{i} from a structural equation, that is, one that represents a causal relationship. Suppose the structural equation is y_{i} = \gamma_{0} + \gamma_{1}x_{i} + \epsilon_{i}, where \gamma_{1} is the causal effect of x on y. Because this is not a regression (i.e. \gamma is not a regression coefficient) there is no restriction on the relationship between x_{i} and \epsilon_{i}. In most econometrics texts, regression is introduced in the context of a causal model, and thus E[\epsilon_{i}x_{i}]=0 is stated as an assumption. And then, when this assumption does not hold we say that the regression estimates are “not consistent” estimators of the true parameters. Really, regression estimates are always consistent – they are consistent for the population regression coefficients. When the error term of the causal model is uncorrelated with x, then the structural equation and the population regression are identical (i.e. \beta=\gamma). When the assumption fails, it is not the case that regression gives you inconsistent estimates, it’s simply that the population regression you are estimating is not the same as the causal relationship you are interested in.

Okay, that detour gives a lot of insight into how to interpret a regression coefficient. It’s worth repeating that regression is purely a statistical relationship; it does not (necessarily) represent any structural or causal relationship between y and x. Regression is a mechanical exercise and can be applied to any set of variables. So how to interpret \beta_{1}? First, let’s not assume anything about what the structural model actually is. If E[y|x] is in fact a linear function, then X'\beta will be this conditional expectation function, and thus the interpretation of \beta_{1} will be the marginal effect of x on E[y|x]. Or, loosely speaking, the “marginal effect on the average”. This is how you should interpret regression coefficients when you make no assumption concerning the underlying causal model (the only assumption I made was that E[y|x] was linear – this does not concern the causal model – and if is in fact nonlinear, then X'\beta provides the best linear approximation to E[y|x].)

Now, suppose the true structural equation is the one I’ve specified earlier (and that E[x_{i}\epsilon_{i}]=0, so that \beta=\gamma). Then, our OLS estimate \beta_{1} can be interpreted as the causal effect of x on y (“marginal effect”). So, there are times when a regression estimate can be thought of as a marginal effect of x on y ; namely, when the structural relationship is linear, the causal effect is constant (the same for all individuals), and x is uncorrelated with all other variables that affect y.

I think introducing regression in a causal framework does a great disservice to students. It gives students the impression that causality and regression are intrinsically linked to each other, and clouds the fact that regression is a purely statistical exercise.

I guess I never made it to when regression estimates could be thought of as “average marginal effects” – this requires thinking about a structural model where the causal effect differs for individuals – this can be saved for a later post!

Source: Mostly Harmless Econometrics (Angrist and Pischke 2009)



About a month ago, professional golfer Dawie Van Der Walt took to Twitter ripping on 2003 Masters Champion, Mike Weir, for taking up valuable (and limited) space in PGA Tour fields. Because of Mike’s victory in 2003, as well as his injuries over the past 10 years, he still is exempt, or sponsored, into many tournaments over the course of the season.Van Der Walt believes Mike is wasting time and taking away spots from younger, more talented individuals who could really use the opportunity to play on golf’s biggest stage.

Well I am here to take a look at Mike’s game, particularly his driving (or lack of), an area of the game many think he may have developed the yips, or as Ernie calls them… the “heebie-jeebies.”ernie-els-yip-at-cimb-classic


Below I made a nice plot of all the drives hit on tour this year (including the wrap-around season). The x-axis is relative distance from the center of the fairway. For example, a value of negative 5 means you were 5 yards left of the middle of the fairway. I separated Mike’s drives from the rest of the tour, as well as Rory McIlroy’s, for a reference.

Data provided by PGA Tour - ShotLink.
Data provided by PGA Tour – ShotLink.

In all seriousness, I was surprised at how good this looked for Weir. However when your eyes do make it over to the vertical axis it does get a little frightening. And considering how short Mike is hitting it, he does not seem to be making up for it in the accuracy department.

Because the above graph is so dense, it can be fairly misleading where the averages lie, so I made some summary statistics before we give Mike the final verdict.

Screen Shot 2016-06-29 at 1.40.04 AM

I think this says it all. He’s hitting significantly more crooked than both Rory and the rest of the tour, and he is hitting it well back of even the Tour Average.

…Being Canadian, this is tough for me to say but…#HangItUpMike