Welcome to the first livestream that will be taking place June 5th at 8am PST, details and link below!
In the async first spirit of this club, each week you’ll have a chance to add questions in a topic such as this one, that will be posted a couple days before the livestream.
Here’s how it works.
I post a topic before the livestream (as I’ve done now)
You ask whatever questions you want answered
We all indicate question popularity by liking the questions that are posted and we want answered
I do my best to answer them during the livestream
With that, ask away and hope to see you all on Sunday when we go through them.
The intro is reference heavy if you aren’t in econ, so putting together some notes / pasting the parts I highlighted
Thoughts for discussion
Cunningham calls out correlation != causation, and alludes to causal assumptions - so causal models are stronger assumptions than statistical models. What is it we are trying to do that merits more assumptions?
Statistics is often taught as a “toolbox” or “recipes” - the ToC and the list of topics in the intro feel like that, a bit disconnected; but Cunningham also alludes to / foreshadows some epistimology that maybe ties it all together?
Notes
Intro / Dramatis personae
Gary Becker
Student of Milton Friedman, “Chicago School” (anti-Keynesian), not without controversy
Notorious crank about Eugenics and (safety of) Tobacco
Trygve Haavelmo
Norwegian Economist, cited by Judea Pearl as major influence
Jeffrey Wooldridge
Still living economist
wrote standard text on econometrics for panel data
Many contributions to Stata
very likable on twitter
Guido Imbens
Nobel/econ winner
With Rubin, cowrote / led adoption of IV approach
Susan Athey
Stanford Econ
side note - married to Guido Imbens
Cunningham lays out the topics (bolded) to expect:
But what books out there do I like? Which ones have inspired this book? And why don’t I just keep using them? For my classes, I mainly relied on S. L. Morgan and Winship (2014), Angrist and Pischke (2009), as well as a library of theoretical and empirical articles. These books are in my opinion definitive classics. But they didn’t satisfy my needs, and as a result, I was constantly jumping between material. Other books were awesome but not quite right for me either. Guide W. Imbens and Rubin (2015) cover the potential outcomes model, experimental design, and matching and instrumental variables, but not directed acyclic graphical models (DAGs), regression discontinuity, panel data, or synthetic control. S. L. Morgan and Winship (2014) cover DAGs, the potential outcomes model, and instrumental variables, but have too light a touch on regression discontinuity and panel data for my tastes. They also don’t cover synthetic control, which has been called the most important innovation in causal inference of the last 15 years by Athey and Imbens (2017a). Angrist and Pischke (2009) is very close to what I need but does not include anything on synthetic control or on the graphical models that I find so critically useful. But maybe most importantly, Guide W. Imbens and Rubin (2015), Angrist and Pischke (2009), and S. L. Morgan and Winship (2014) do not provide anypractical programming guidance, and I believe it is in replication and coding that we gain knowledge in these areas.4
When we make a claim about causation, it’s not so we can hide out from the world but so we can intervene in it. A false positive means approving drugs that have no effect, or imposing regulations that make no difference, or wasting money in schemes to limit unemployment. As science grows more powerful and government more technocratic, the stakes of correlation—of counterfeit relationships and bogus findings—grow ever larger. The false positive is now more onerous than it’s ever been. And all we have to fight it is a catchphrase.
Economic theory tells us we should be suspicious of correlations found in observational data. In observational data, correlations are almost certainly not reflecting a causal relationship because the variables were endogenously chosen by people who were making decisions they thought were best.
And we see this problem reflected in the potential outcomes model itself: a correlation, in order to be a measure of a causal effect, must be based on a choice that was made independent of the potential outcomes under consideration. Yet if the person is making some choice based on what she thinks is best, then it necessarily is based on potential outcomes, and the correlation does not remotely satisfy the conditions we need in order to say it is causal.
Now we are veering into the realm of epistemology.
Identifying causal effects involves assumptions, but it also requires a particular kind of belief about the work of scientists.
Example: Identifying Price Elasticity of Demand
we would like to develop and test a formal economic model that describes mathematically a certain relationship, behavior, or process of interest. Those models are valuable insofar as they both describe the phenomena of interest and make falsifiable (testable) predictions. A prediction is falsifiable insofar as we can evaluate, and potentially reject, the prediction with data.
Comparative statics are theoretical descriptions of causal effects contained within the model. These kinds of comparative statics are always based on the idea of ceteris paribus —or “all else constant.” When we are trying to describe the causal effect of some intervention, for instance, we are always assuming that the other relevant variables in the model are not changing.
If they were changing, then they would be correlated with the variable of interest and it would confound our estimation.
Foreshadowing the content of this mixtape, we need two things to estimate price elasticity of demand. First, we need numerous rows of data on price and quantity. Second, we need for the variation in price in our imaginary data set to be independent of uu. We call this kind of independence exogeneity . Without both, we cannot recover the price elasticity of demand, and therefore any decision that requires that information will be based on stabs in the dark.
Conclusion
In conclusion, simply finding an association between two variables might be suggestive of a causal effect, but it also might not. Correlation doesn’t mean causation unless key assumptions hold. Before we start digging into the causal methodologies themselves, though, I need to lay down a foundation in statistics and regression modeling.
Thank you @nfultz this detailed overview which is certainly correlated with the spirit of this book club.
Here’s my notes as well https://github.com/canyon289/causal_inf_bookclub/blob/main/chapters/01_Introduction_Notes.md
It’s interesting to see both the similarities and differences. The biggest one, I also noted the use of economics terms without much introduction. I found myself starting a glossary as I have a feeling both these two words are going to be used again, and many others are coming
Little experiences caused big changes in the direction of Scott’s life
Same thing happened to me!
So how would I define causal inference?
Causal inference is the leveraging of theory and deep knowledge of institutional details to estimate the impact of events and choices on a given outcome of interest
Choices are endogenous, and therefore since they are, the correlations between those choices and outcomes in the aggregate will rarely, if ever, represent a causal effect
The actual methods employed in causal designs are always deeply dependent on theory and local institutional knowledge
That without prior knowledge, estimated causal effects are rarely, if ever, believable. Prior knowledge is required in order to justify any claim of a causal finding
Human beings engaging in optimal behavior are the main reason correlations almost never reveal causal relationships, because rarely are human beings acting randomly. And as we will see, it is the presence of randomness that is crucial for identifying causal effect.
Does absence of randomness imply that the relationship will not be causal?
Human beings engaging in optimal behavior are the main reason correlations almost never reveal causal relationships, because rarely are human beings acting randomly. And as we will see, it is the presence of randomness that is crucial for identifying causal effect.
I thought this – along with the supply/demand example – was a really compelling motivation. But, I’ve been chewing on it a bit and I’d love to think through whether it’s necessarily motivation for causal inference or whether it’s motivation for sophisticated modeling strategies more generally.
I often use generative or mechanistic models in my work (which is not at all economics). These are still statistical (i.e., fit to data somehow), but they’re not part of this Causal Inference school of thought (closer to McElreath’s “full luxury Bayes”). Are these models also motivated by this quote? Or is there some kind of a subtle distinction here?
Thank you for this initial discussion all. The term “endogenous” felt like it came out of left field for me as well. The connection/assumption I made was thinking in terms of my prior exposure to DAGs, where he means a variable is influenced (endogenous or within the DAG) by some other upstream variable. Therefore, it’s hard to see the influence (cause) of the sailor’s oar itself, if we don’t recognize that the wind direction can influence it. If we were to hypothetically simulate multiple random wind directions, however, we would be able to see it.
Thanks everyone for sharing their notes! Appreciated!
Since I have an ML background, the terms exogenous and endogenous were not so clear to me either. But after a bit of googling I found Exogenous and endogenous variables - Wikipedia.
I liked the short summary of the difference between them in the beginning of that article.
That article however also links to Endogeneity (econometrics) - Wikipedia, which explains a slightly different endogeneity. For now I think that the explanation in the first link better fits the mixtape book introduction. But IANAE (I Am Not An Econometrist).
Just to be clear i’m not an economist but I spent some time working through the a simple supply and demand model a while back. Drawing on David Freedman’s discussion of butter production in his book: https://www.amazon.de/-/en/David-Freedman/dp/0521743850
What helped make the notion of “clear” for me was price endogeniety within a market equilibrium in particular how if we have statistical models for supply and demand which both depend on price we can show how price (at equilibrium) is a function of the error terms in the separate equations for supply and demand. This violates the simple assumptions of OLS estimation procedures.
So for estimating the impact of price on demand equations it’s important to get the model “right” to avoid the corrupting influence of supply factors. You can totally see how these type of concerns would lead naturally in econometrics to a concern for the data generating processes and causal modelling, but the language is oddly keyed to limits of the model estimation procedure rather than they causal “story”… just my impression, but i think some of different phrasings of these causal estimate problems is part of what makes the topic hard.
** * Causal inference is the leveraging of theory and deep knowledge of institutional details to estimate the impact of events and choices on a given outcome of interest**
I guess Cunningham’s use of “deep knowledge of institutional details” refers to the domain knowledge. Economists use the word “institutions” in a very general way.
In sum: Each specific causal modelling strategy (IV, or diff-in-diff) relies on usually untestable assumptions, and that requires to see if they are relevant in the particular context. Hence domain knowledge.