The policing example at the end of Chapter 3 confused me a bit. Is conditioning on the stop wrong only because it may cause sampling bias? I may not be fully grasping the concept of a collider.
Since minority has a direct effect on force in the DAG, why would conditioning on stop affect the ability to isolate this effect?
source: Causal Inference The Mixtape - 3Β Directed Acyclic Graphs
It helps me to look at the pathways from D to Y. According to this DAG, there are four:
D β Y
D β X β Y
D β M β Y
D β M β U β Y
(Fryer controls for X and closes that backdoor through his research design.)
So when we look at the causal pathway D β M β Y and if we were to regress Y onto D, then the causal effect would be measuring both the discrimination inherent in both the stop and the use of force. The junction D β M β U is a collider and blocks the backdoor pathway D β M β U β Y.
By conditioning on M (the stop), it opens up this pathway. The only way to condition on the stop and keep the desired effect of a closed backdoor, would be to condition on both M (the stop) and U (suspicion). But U is unknown and therefore cannot be conditioned. So conditioning on the stop without also conditioning on suspicion according to this DAG, introduces spurious correlations that skew any attempt to determine causal effect. Suspicion is affecting M and Y and is an unaccounted for variable M β U β Y. Conditioning on the stop, reduces the DAG to just the M - U - Y triangle.
From a conceptual point of view, conditioning on the stop to determine use of force then ignores the sampling bias introduced from the discrimination observed in the stop.
That is what is really well illustrated by the coding examples. You can hard code in bias and then using these techniques demonstrate how they can produce the wrong answer.
I suppose you could argue whether or not the variable U is valid in this DAG. But since the point of the study is in someway trying to quantify U or understand its quality, ignoring it or excluding it would seem to make the DAG incomplete.
I suppose another question to ask is what is the difference between Discrimination and Suspicion? Are those variables independent of each other? Do they need to be? I suppose the whole point of this DAG is to point out that there is inherent unobservable Suspicion between the Stop and the Use of Force. It is not observable and since it cannot be controlled-for, any sort of discrimination inherent in Suspicion can also not be controlled-for. It is unknown and therefore disrupts any attempt to measure causality between M β Y.
Thanks for the response. Its starting to make some more sense. However when I am trying to run some simulations I am not able to recover the parameter values.
def collider_and_confounder(size):
"""Create a collider and confounder example"""
# Is confounder additive or multiplicative
unit_normal = stats.norm(0,1)
d = unit_normal.rvs(size)
z = unit_normal.rvs(size)
# For some reason need to add the coefficient here
x = 8.9*d + 2.34*z
y = 3.567*d + 1.234*z + 2.456*x + 21.123
collider_and_confounder_df = pd.DataFrame({"x":x, "d":d, "z":z, "y":y})
return collider_and_confounder_df
collider_and_confounder_df = collider_and_confounder(10000)
if i am not mistaken, x is a collider in this instance through d β x β z β y.
So, if I condition on x and z I should have no open paths correct?
Thanks for sharing @bwalters. Could you share the regression youβre using? Then we can get a sense of what may be going wrong
This is the regression
mod = smf.ols(formula='y ~ d+z+x', data=collider_and_confounder_df)
And this is the output
Intercept 21.123000
d 0.224747
z 0.355250
x 2.831534
dtype: float64
Multicollinearity is at play in your example. Because x is determined entirely by d and z, including all three in a regression is problematic. If you substitute x = 8.9d + 2.34z into your population model, what youβre effectively trying to estimate using regression is the equation y = 25.4254d + 6.98104z. By performing the same substitution with your fitted equation, youβll see that thatβs what youβve ended up with.
If you include some disturbance in your equation for x (draws from the standard normal should work), you should be able to recover the parameters youβve specified. You can then explore collider bias by estimating y = d to get the total effect of d; y = z to get the total effect of z; y = d + z to get the total effect of d and the total effect of z; y = d + x to open up the path d β x β z β y; y = z + x to open up the path z β x β d β y; and finally, y = d + x + z.
Nice catch @jahloy! Thanks for helping out