Friday 29th November 2024, 10:00–10:15 (Australia/Melbourne), Sutherland Room
Nonresponse is a critical issue for data quality in panel surveys. Many researchers have demonstrated the potential of machine learning models to predict nonresponse, which would then allow survey managers to pre-emptively intervene with low-propensity participants. Typically, modelers fit their machine learning models to the panel data that has accumulated several waves and report which algorithm and variables yielded the best predictive results. However, these studies do not tell a manager of a yet-to-commence panel survey which technique is best for their own context (e.g., annual vs. quarterly waves, household vs. individual sampling). Studies have shown mixed results regarding the performance of nonresponse prediction in different panel contexts. In addition, there is considerable variation in which prediction technique (e.g., algorithm and variables) performs best across survey settings. It is thus unclear under which conditions predictive models successfully identify nonresponders and which techniques are best suited to which contexts.
To address the question of cross-panel generalizability, we compare machine learning-based nonresponse prediction across five panel surveys of the general German population: the Socio-Economic Panel (SOEP), the German Internet Panel (GIP), the GESIS Panel, the Mannheim Corona Stud (MCS), and the Family Demographic Panel (FREDA). We evaluate how differences in the design of the surveys and differences in the sample composition (e.g., average sample age and income) impact the characteristics of the best-performing machine learning model (e.g., the best algorithm, accuracy scores, and the most predictive variables). We compare which (types of) variables and algorithms are the most predictive across these contexts. We also evaluate how well techniques from one survey transfer to a different survey context. Our analysis shows the extent to which practitioners can expect the modeling techniques of one survey to generalize to their own context and the factors which might inhibit generalizability.
John ‘Jack’ Collins is a PhD Student in Sociology at the Graduate School of Economic and Social Sciences. He holds a Bachelor’s of Sociology with Honours from the Australian National University. Jack has a Master’s degree in Data Science from James Cook University. His Master’s project was regarding predictive modelling for student attrition from sub-tertiary courses in Australia. During his Master’s studies, he also assisted in research projects regarding social attitudes and voting behaviour in Australia. Before starting PhD, Jack was a Senior IT Consultant specialising in data engineering, analytics and software development. Jack is interested in applying Data Science and IT to sociological research, particularly with regard to machine learning, analytics, and web applications.