8th Biennial ACSPRI Social Science Methodology Conference

Combining qualitative data and causal machine learning for better estimation
11-23, 17:20–17:35 (Australia/Melbourne), Zoom Breakout Room 2

Double Machine Learning (DML) is a causal machine learning method that promises substantial benefits when estimating average treatment effects in observational data, particularly where existing theory is too weak to identify controls or justify a quasi-experimental approach. DML benefits from much of the power and flexibility of predictive models, while also giving the unbiased causal estimates of traditional regression approaches. However, in-practice it often involves relaxing causal identification assumptions, assuming algorithms will correctly identify controls in high-dimensional datasets.

Explicitly constraining model fitting with a causal graph (a diagram laying out which variables cause changes in which) is one solution that has been suggested for better causal identification, but the benefits of this approach are yet to be established and a proper methodology for constructing them has not been laid out (excluding data-driven causal discovery which has its own serious drawbacks).

This presentation looks at where DML can be useful in the social sciences and where we might be able to draw in qualitative data to build causal graphs and improve inference. It covers two studies, one on returns of education and one on the effect of private schooling on standardised testing performance. In both cases, causal graphs were constructed by interviewees with varying levels of background knowledge and models fit under these constraints were compared with unconstrained models. The return on education study used instrumental variables to get yardstick causal estimates for comparison. The private schooling study uses semi-synthetic data to establish ground truth. While both cases have good existing theory, relying only on interviews tests whether it is possible to build pragmatic causal assumptions even when theory is poor (as this is often when DML is used).

In both studies, constrained and unconstrained DML estimation performed roughly equally well on large samples, though unconstrained models performed worse on small subsamples (n = 1000). Importantly, even a basic level of background knowledge outperformed unconstrained DML in these cases. Combining up multiple graphs into one further reduced bias.

Unconstrained DML seems to be a useful approach where identification is achieved through control-on-observables and where the sample size is large. However, a mixed-methods approach where qualitative data is used to shape causal assumptions may improve estimation where large samples are not available.

Recording link: https://acspri-org-au.zoom.us/rec/share/fe80_c8_scf5yQJEY_Vv6vJ6R7XFjJUKwZnkpSbPOus_gP1IIKiSQO8FF_LSLDiZ.j0lwiy1bIA1ehAAO?startTime=1669184550000


Do NOT record this presentation – no

Patrick Rehill is a PhD Candidate and Research Officer at the ANU Centre for Social Research and Methods. His research looks at applying machine learning methods to policy evaluation. His professional background is as a quantitative researcher in public policy and the not-for-profit sector. He holds a Bachelor of Arts (Honours) and Master of Public Policy and Management both from the University of Melbourne.