\documentclass{article}
\usepackage{url}
%\VignetteIndexEntry{Estimates in subpopulations}
\usepackage{Sweave}
\author{Thomas Lumley}
\title{Estimates in subpopulations.}

\begin{document}
\maketitle

Estimating a mean or total in a subpopulation (domain) from a survey, eg the
mean blood pressure in women, is not done simply by taking the subset
of data in that subpopulation and pretending it is a new survey.  This
approach would give correct point estimates but incorrect standard
errors.

The standard way to derive domain means is as ratio estimators. I
think it is easier to derive them as regression coefficients.  These
derivations are not important for R users, since subset operations on
survey design objects automatically do the necessary adjustments, but
they may be of interest.  The various ways of constructing domain mean
estimators are useful in quality control for the survey package, and
some of the examples here are taken from
\texttt{survey/tests/domain.R}.


Suppose that in the artificial \texttt{fpc} data set we want to
estimate the mean of \texttt{x} when \texttt{x>4}.
<<>>=
library(survey)
data(fpc)
dfpc<-svydesign(id=~psuid,strat=~stratid,weight=~weight,data=fpc,nest=TRUE)
dsub<-subset(dfpc,x>4)
svymean(~x,design=dsub)
@ 

The \texttt{subset} function constructs a survey design object with
information about this subpopulation and \texttt{svymean} computes the
mean. The same operation can be done for a set of subpopulations with
\texttt{svyby}.
<<>>=
svyby(~x,~I(x>4),design=dfpc, svymean)
@ 

In a regression model with a binary covariate $Z$ and no intercept,
there are two coefficients that estimate the mean of the outcome
variable in the subpopulations with $Z=0$ and $Z=1$, so we can
construct the domain mean estimator by regression.
<<>>=
summary(svyglm(x~I(x>4)+0,design=dfpc))
@ 

Finally, the classical derivation of the domain mean estimator is as a
ratio where the numerator is $X$ for observations in the domain and 0
otherwise and the denominator is 1 for observations in the domain and
0 otherwise
<<>>=
svyratio(~I(x*(x>4)),~as.numeric(x>4), dfpc)
@ 

The estimator is implemented by setting the sampling weight to zero
for observations not in the domain.  For most survey design objects
this allows a reduction in memory use, since only the number of zero
weights in each sampling unit needs to be kept. For more complicated
survey designs, such as post-stratified designs, all the data are kept
and there is no reduction in memory use.


\subsection*{More complex examples}
Verifying that \texttt{svymean} agrees with the ratio and regression
derivations is particularly useful for more complicated designs where
published examples are less readily available.

This example shows calibration (GREG) estimators of domain means for
the California Academic Performance Index (API).
<<>>=
data(api)
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
pop.totals<-c(`(Intercept)`=6194, stypeH=755, stypeM=1018)
gclus1 <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069))

svymean(~api00, subset(gclus1, comp.imp=="Yes"))
svyratio(~I(api00*(comp.imp=="Yes")), ~as.numeric(comp.imp=="Yes"), gclus1)
summary(svyglm(api00~comp.imp-1, gclus1))
@ 

Two-stage samples with full finite-population corrections
<<>>=
data(mu284)
dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284)

svymean(~y1, subset(dmu284,y1>40))
svyratio(~I(y1*(y1>40)),~as.numeric(y1>40),dmu284)
summary(svyglm(y1~I(y1>40)+0,dmu284))
@ 

Stratified two-phase sampling of children with Wilm's Tumor,
estimating relapse probability for those older than 3 years (36
months) at diagnosis
<<>>=
library("survival")
data(nwtco)
nwtco$incc2<-as.logical(with(nwtco, ifelse(rel | instit==2,1,rbinom(nrow(nwtco),1,.1))))
dccs8<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~interaction(rel,stage,instit)),
                data=nwtco, subset=~incc2)
svymean(~rel, subset(dccs8,age>36))
svyratio(~I(rel*as.numeric(age>36)), ~as.numeric(age>36), dccs8)
summary(svyglm(rel~I(age>36)+0, dccs8))
@

\end{document}
