\documentclass{article}
\usepackage{url}
\addtolength{\topmargin}{-0.5in}
\addtolength{\textheight}{0.75in}
\addtolength{\oddsidemargin}{-0.5in}
\addtolength{\textwidth}{1in}
%\VignetteIndexEntry{Two-phase designs in epidemiology}
\usepackage{Sweave}
\author{Thomas Lumley}
\title{Two-phase designs in epidemiology}

\begin{document}
\maketitle
This document explains how to analyse case--cohort and two-phase
case--control studies with the ``survey'' package, using examples from
\url{http://faculty.washington.edu/norm/software.html}. Some of the
examples were published by  Breslow \& Chatterjee (1999).

The data are relapse rates from the National Wilm's Tumor
Study (NWTS). Wilm's Tumour is a rare cancer of the kidney in
children. Intensive treatment cures the majority of cases, but
prognosis is poor when the disease is advanced at diagnosis and for
some histological subtypes.  The histological characterisation of the
tumour is difficult, and histological group as determined by the NWTS
central pathologist predicts much better than determinations by local
institution pathologists. In fact, local institution histology can be
regarded statistically as a pure surrogate for the central lab
histology.

In these examples we will pretend that the (binary) local institution
histology determination (\texttt{instit}) is avavailable for all
children in the study and that the central lab histology
(\texttt{histol}) is obtained for a probability sample of specimens in
a two-phase design. We treat the initial sampling of the study as
simple random sampling from an infinite superpopulation.  We also have
data on disease stage, a four-level variable; on relapse; and on time
to relapse.

\section*{Case--control designs}

Breslow \& Chatterjee (1999) use the NWTS data to illustrate two-phase
case--control designs. The data are available at
\url{http://faculty.washington.edu/norm/software.html} in compressed
form; we first expand to one record per patient.
<<>>=
library(survey)
load(system.file("doc","nwts.rda",package="survey"))
nwtsnb<-nwts
nwtsnb$case<-nwts$case-nwtsb$case
nwtsnb$control<-nwts$control-nwtsb$control

a<-rbind(nwtsb,nwtsnb)
a$in.ccs<-rep(c(TRUE,FALSE),each=16)

b<-rbind(a,a)
b$rel<-rep(c(1,0),each=32)
b$n<-ifelse(b$rel,b$case,b$control)
index<-rep(1:64,b$n)

nwt.exp<-b[index,c(1:3,6,7)]
nwt.exp$id<-1:4088
@ 

As we actually do know \texttt{histol} for all patients we can fit the logistic regression model with full sampling to compare with the two-phase analyses
<<>>=
glm(rel~factor(stage)*factor(histol), family=binomial, data=nwt.exp)
@ 

 The second phase sample consists of all patients with unfavorable
 histology as determined by local institution pathologists, all cases,
 and a 20\% sample of the remainder.  Phase two is thus a stratified
 random sample without replacement, with strata defined by the
 interaction of \texttt{instit} and \texttt{rel}.

<<>>=
dccs2<-twophase(id=list(~id,~id),subset=~in.ccs,
                strata=list(NULL,~interaction(instit,rel)),data=nwt.exp)

summary(svyglm(rel~factor(stage)*factor(histol),family=binomial,design=dccs2))
@ 

Disease stage at the time of surgery is also recorded. It could be
used to further stratify the sampling, or, as in this example, to
post-stratify.  We can analyze the data either pretending that the
sampling was stratified or using \texttt{calibrate} to post-stratify
the design.

<<>>=
dccs8<-twophase(id=list(~id,~id),subset=~in.ccs,
                strata=list(NULL,~interaction(instit,stage,rel)),data=nwt.exp)
gccs8<-calibrate(dccs2,phase=2,formula=~interaction(instit,stage,rel))

summary(svyglm(rel~factor(stage)*factor(histol),family=binomial,design=dccs8))
summary(svyglm(rel~factor(stage)*factor(histol),family=binomial,design=gccs8))
@


\section*{Case--cohort designs}
In the case--cohort design for survival analysis, a $P$\% sample of a cohort
is taken at recruitment for the second phase, and all participants who
experience the event (cases) are later added to the phase-two sample.

Viewing the sampling design as progressing through time in this way,
as originally proposed, gives a double sampling design at phase two.
It is simpler to view the process \emph{sub specie aeternitatis}, and
to note that cases are sampled with probability 1, and controls with
probability $P/100$. The subcohort will often be determined
retrospectively rather than at recruitment, giving stratified random
sampling without replacement, stratified on case status.  If the
subcohort is determined prospectively we can use the same analysis,
post-stratifying rather than stratifying.

There have been many analyses proposed for the case--cohort design
(Therneau \& Li, 1999).  We consider only those that can be expressed as a
Horvitz--Thompson estimator for the Cox model.



First we load the data and the necessary packages. The version of the
NWTS data that includes survival times is not identical to the data
set used for case--control analyses above.
<<>>=
library(survey)
library(survival)
data(nwtco)
ntwco<-subset(nwtco, !is.na(edrel))
@ 

Again, we fit a model that uses \texttt{histol} for all patients, to compare with the two-phase design
<<>>=
coxph(Surv(edrel, rel)~factor(stage)+factor(histol)+I(age/12),data=nwtco)
@ 

We define a two-phase survey design using simple random
superpopulation sampling for the first phase, and sampling without
replacement stratified on \texttt{rel} for the second phase. The
\texttt{subset} argument specifies that observations are in the phase-two sample if they are in the subcohort or are cases.  As before, the data structure is rectangular, but variables measured at phase two may be \texttt{NA} for participants not included at phase two.

We compare the result to that given by \texttt{survival::cch} for Lin
\& Ying's (1993) approach to the case--cohort design.


<<>>=
(dcch<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~rel),
                  subset=~I(in.subcohort | rel), data=nwtco))
svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12),
                design=dcch)

subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,] 
ccoh.data$subcohort <- subcoh[selccoh]
cch(Surv(edrel, rel) ~ factor(stage) + factor(histol) + I(age/12),
           data =ccoh.data, subcoh = ~subcohort, id=~seqno,
           cohort.size=4028, method="LinYing")
@ 


Barlow (1994) proposes an analysis that ignores the finite population
correction at the second phase.  This simplifies the standard error
estimation, as the design can be expressed as one-phase stratified
superpopulation sampling. The standard errors will be somewhat
conservative. More data preparation is needed for this analysis as the
weights change over time.
<<>>=
nwtco$eventrec<-rep(0,nrow(nwtco))
nwtco.extra<-subset(nwtco, rel==1)
nwtco.extra$eventrec<-1
nwtco.expd<-rbind(subset(nwtco,in.subcohort==1),nwtco.extra)
nwtco.expd$stop<-with(nwtco.expd, 
                      ifelse(rel & !eventrec, edrel-0.001,edrel))
nwtco.expd$start<-with(nwtco.expd, 
                       ifelse(rel & eventrec, edrel-0.001, 0))
nwtco.expd$event<-with(nwtco.expd,
                       ifelse(rel & eventrec, 1, 0))
nwtco.expd$pwts<-ifelse(nwtco.expd$event, 1, 1/with(nwtco,mean(in.subcohort | rel)))
@ 

The analysis corresponds to a cluster-sampled design in which
individuals are sampled stratified by subcohort membership and then
time periods are sampled stratified by event status.  Having
individual as the primary sampling unit is necessary for correct
standard error calculation. 

<<>>=
(dBarlow<-svydesign(id=~seqno+eventrec, strata=~in.subcohort+rel,
                    data=nwtco.expd, weight=~pwts))
svycoxph(Surv(start,stop,event)~factor(stage)+factor(histol)+I(age/12),
                design=dBarlow)
@ 

In fact, as the finite population correction is not being used the second stage of the cluster sampling could be ignored.   We can also produce the stratified bootstrap standard errors of Wacholder et al (1989), using a replicate weights analysis

<<>>=
(dWacholder <- as.svrepdesign(dBarlow,type="bootstrap",replicates=500))
svycoxph(Surv(start,stop,event)~factor(stage)+factor(histol)+I(age/12),
                design=dWacholder)
@ 


\subsection*{Exposure-stratified designs}


Borgan et al (2000) propose designs stratified or post-stratified on
phase-one variables. The examples at
\url{http://faculty.washington.edu/norm/software.html} use a different
subcohort sample for this stratified design, so we load the new
\texttt{subcohort} variable
<<>>=
load(system.file("doc","nwtco-subcohort.rda",package="survey"))
nwtco$subcohort<-subcohort

d_BorganII <- twophase(id=list(~seqno,~seqno),
                       strata=list(NULL,~interaction(instit,rel)),
                       data=nwtco, subset=~I(rel |subcohort))
(b2<-svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12),
                design=d_BorganII))
@ 


We can further post-stratify the design on disease stage and age with \texttt{calibrate}
<<>>=
d_BorganIIps <- calibrate(d_BorganII, phase=2, formula=~age+interaction(instit,rel,stage))
svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12),
                  design=d_BorganIIps)
@ 


\section*{References}

Barlow WE (1994). Robust variance estimation for the case-cohort
design. \emph{Biometrics} 50: 1064-1072

Borgan \O, Langholz B, Samuelson SO, Goldstein L and Pogoda J (2000). Exposure stratified case-cohort designs,  \emph{Lifetime Data Analysis}  6:39-58

Breslow NW and Chatterjee N. (1999) Design and analysis of two-phase
studies with binary outcome applied to Wilms tumour prognosis.  \emph{Applied
Statistics}  48:457-68.


Lin DY, and Ying Z (1993). Cox regression with incomplete covariate measurements.
\emph{Journal of the American Statistical Association} 88: 1341-1349.

Therneau TM and Li H., Computing the Cox model for case-cohort
designs. \emph{Lifetime Data Analysis} 5:99-112, 1999

Wacholder S, Gail MH, Pee D, and Brookmeyer R (1989)
Alternate variance and efficiency calculations for the case-cohort design
\emph{Biometrika}, 76, 117-123 
\end{document}
