We began our Survey Statistics journey with this large mountain: not everybody might be in our pattern (“unit nonresponse”). Past that mountain is one other mountain: not everybody in our pattern solutions all survey questions (“merchandise nonresponse”). Right here “nonresponse” means both not being sampled or requested, in addition to refusing to reply. All lead to lacking knowledge.
For a visible, I like Determine 10.4 from Groves:
Multilevel Regression and Poststratification (MRP) goals to deal with unit nonresponse. Suppose we wish to estimate E[Y], the inhabitants imply. However we solely have Y for respondents. For instance, suppose Y is voting Republican. And what if respondents are kind of Republican than the inhabitants ? If we’ve got inhabitants knowledge on X, e.g. a bunch of demographic variables, then we will estimate E[Y|X] and mixture: E[Y] = E[E[Y|X]]. So if our pattern has the fallacious distribution of X, at the very least we repair that with some calibration.
However what if among the X are lacking ? From Bayesian Information Evaluation p.451:
The paradigmatic setting for lacking knowledge imputation is regression, the place we have an interest within the mannequin p(y|X, θ) however have lacking values within the matrix X.
Andrew has blogged about MRP and merchandise nonresponse, recommending one large joint mannequin for Y and X. Or “assemble some imputed datasets, and go on and do MRP with these.” Extra from Bayesian Information Evaluation p.451:
First mannequin X, y collectively…At this level, the imputer takes the shocking step of discarding the inferences concerning the parameters, preserving solely the finished datasets Xs…
This line actually helped me perceive imputation. Particularly the phrases “shocking step”. As a result of actually, we go to all this bother to mannequin every little thing, after which… why aren’t we performed ? We’d be performed if we actually believed on this one large joint mannequin. However perhaps we wish to be extra cautious, particularly about how we mannequin E[Y|X]. So we throw away a few of our work and simply maintain the imputed Xs.
What’s extra, we maintain a number of variations of those imputed Xs, as a result of we wish to mirror our uncertainty about them. Then we mix these a number of variations of our evaluation. For extra about A number of Imputation (MI) see, e.g. Stef van Buuren’s e-book.
Okay, so this sounds smart ! Implementation time. Right here’s the place I get caught:
- Scale: You’ve acquired 1000s of X predictors (in 100s of batches), and 100,000s of survey responses. All the things might be lacking.
- Cross-validation: Kuh et al 2023 say cross-validation is probably not appropriate to judge the MRP mannequin for E[Y|X], however folks do it (Wang & Gelman 2014). Jaeger et al. (2020) remind us to do imputation (which makes use of the Y) throughout every cross-validation replicate. They examine if we will get away with imputation with out Y, as a step earlier than cross-validation.
So we’ve acquired a scale downside, made even worse if we do imputation throughout cross-validation.
Two latest papers in Statistical Strategies in Medical Analysis look into getting away with single, deterministic imputation of lacking Xs with out utilizing Y:
- D’Agostino McGowan et al. (2024): The “Why” behind together with “Y” in your imputation mannequin. See arXiv for entry.
- Sisk et al. (2023): Imputation and lacking indicators for dealing with lacking knowledge within the improvement and deployment of medical prediction fashions: A simulation examine.
Let:
- Z = noticed covariates
- X = unobserved covariates
- Y = final result
D’Agostino McGowan et al. (2024) have a look at steady Y and linear fashions for E[Y|X,Z]. Sisk et al. (2023) have a look at binary Y and logistic fashions for E[Y|X,Z]. Each think about:
- deterministic imputations
- with the result Xhat(Z,Y), estimating E[X | Z, Y]
- or with out Xhat(Z), estimating E[X | Z]
- random imputations
- with the result X ~ p(x | z, y)
(That is the deluxe model of imputation that Andrew recommends.) - or with out X ~ p(x | z)
- with the result X ~ p(x | z, y)
Let’s see how their advice does with a linear MRP final result mannequin E[ Y | Z, X ] = b0 + b1 X + b2 Z + b3 X Z.
Suppose we’ve got an ideal imputation mannequin E[X | Z] and final result mannequin, then we’d have E[Y | Z, E[X | Z] ] which is simply E[Y | Z] (as a result of me telling you Z is similar as me telling you Z and a few perform of Z).
Then we will iterate the expectation to get E[ E[ Y | Z, X ] | Z] = b0 + b1 E[X | Z] + b2 Z + b3 E[X | Z] Z, getting again the parameters of our true MRP final result mannequin.
But when the mannequin is logistic, then this doesn’t fairly undergo. Certainly, Sisk et al. (2023) say they get “minimal bias”, in contrast to D’Agostino McGowan et al. (2024) who present unbiasedness within the linear case.
So the place does this go away us ? The dimensions challenge is severe. With nonresponse bias worsening, we wish to regulate for lots of covariates X. That is in stress with dealing with lacking covariates with one large joint mannequin for Y and X (or with imputation throughout cross-validation). I respect these papers that look into what practitioners are sometimes doing !