\documentclass{article}
\usepackage{url}
%\VignetteIndexEntry{A survey analysis example}
\usepackage{Sweave}
\author{Thomas Lumley}
\title{A survey analysis example}

\begin{document}
\maketitle

This document provides a simple example analysis of a survey data set,
a subsample from the California Academic Performance Index, an annual
set of tests used to evaluate California schools. The API website,
including the original data files are at
\url{http://api.cde.ca.gov}. The subsample was generated as a teaching
example by Academic Technology Services at UCLA and was obtained from
\url{http://www.ats.ucla.edu/stat/stata/Library/svy_survey.htm}.


We have a cluster sample in which 15 school districts were sampled and
then all schools in each district. This is in the data frame
\texttt{apiclus1}, loaded with \texttt{data(api)}. The two-stage sample is
defined by the sampling unit (\texttt{dnum}) and the population
size(\texttt{fpc}).  Sampling weights are computed from the population
sizes, but could be provided separately.
<<>>=
library(survey)
data(api)
dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
@ 

The \texttt{svydesign} function returns an object containing the survey data and metadata.
<<>>=
summary(dclus1)
@ 

We can compute summary statistics to estimate the mean, median, and
quartiles of the Academic Performance Index in the year 2000, the
number of elementary, middle, and high schools in the state, the total
number of students, and the proportion who took the test.  Each
function takes a formula object describing the variables and a survey
design object containing the data.
<<>>=
svymean(~api00, dclus1)
svyquantile(~api00, dclus1, quantile=c(0.25,0.5,0.75), ci=TRUE)
svytotal(~stype, dclus1)
svytotal(~enroll, dclus1)
svyratio(~api.stu,~enroll, dclus1)
@

The ordinary R subsetting functions \verb'[' and \texttt{subset} work
correctly on these survey objects, carrying along the metadata needed
for valid standard errors. Here we compute the proportion of high
school students who took the test
<<>>=
svyratio(~api.stu, ~enroll, design=subset(dclus1, stype=="H"))
@ 

The warnings referred to in the output occured because several
school districts have only one high school sampled, making the second
stage standard error estimation unreliable.

Specifying a large number of variables is made easier by the \texttt{make.formula} function
<<>>=
vars<-names(apiclus1)[c(12:13,16:23,27:37)]
svymean(make.formula(vars),dclus1,na.rm=TRUE)
@ 

Summary statistics for subsets can also be computed with
\texttt{svyby}. Here we compute the average proportion of ``English
language learners'' and of students eligible for subsidized school
meals for elementary, middle, and high schools
<<>>=
svyby(~ell+meals, ~stype, design=dclus1, svymean)
@ 


Regression models show that these socieconomic variables predict API score and whether the school achieved its API target
<<>>=
regmodel <- svyglm(api00~ell+meals,design=dclus1)
logitmodel <- svyglm(I(sch.wide=="Yes")~ell+meals, design=dclus1, family=quasibinomial())
summary(regmodel)
summary(logitmodel)
@ 

We can calibrate the sampling using the statewide total for the previous year's API 
<<>>=
gclus1 <- calibrate(dclus1, formula=~api99, population=c(6194, 3914069))
@ 
which improves estimation of some quantities
<<>>=
svymean(~api00, gclus1)
svyquantile(~api00, gclus1, quantile=c(0.25,0.5,0.75), ci=TRUE)
svytotal(~stype, gclus1)
svytotal(~enroll, gclus1)
svyratio(~api.stu,~enroll, gclus1)
@


\end{document}
