---
title: "Pseudo-values for survival data"
author: Terry Therneau
date: '`r format(Sys.time(),"%d %B, %Y")`'
bibliography: refer.bib
output: 
    bookdown::html_document2:
        base_format: rmarkdown::html_vignette
        number_sections: true
        toc: true
vignette: >
  %\VignetteIndexEntry{Pseudo-values for survival data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
options(continue="  ", width=70, prompt="> ")
options(contrasts=c("contr.treatment", "contr.poly")) #ensure default
options(show.significant.stars = FALSE) #statistical intelligence

knitr::opts_chunk$set(
  collapse = TRUE, warning=FALSE, error=FALSE, tidy=FALSE,
  fig.asp=.75, fig.align="center",
  comment = "#>", highlight=TRUE, echo=TRUE, prompt=FALSE,
  fig.width=7, fig.height=5.5
)

library(survival)
library(geepack)
library(survey)
```

# Introduction

Let $Y$ be survival time and $f$ some function of interest, for which we desire
to estimate $E(f(Y))$ over the population.
With complete data on each observation's survival time, we can compute the
expectation in the straightforward way as the simple mean $\sum f(y_i) /n$.
Due to censoring, survival data is unfortunately incomplete.

Pseudo-values are based on a simple idea.  Suppose the data are incomplete but
we have an estimator 
$\hat\theta = E(f(Y)$ of the quantity of interest, e.g., 
$\theta$ = survival probability at
time 45, and $\hat\theta$ the Kaplan-Meier estimate at time 45.
The pseudo-value for $y_i$ is then defined as

\begin{equation}
  \theta_{(i)} = \hat\theta + (n-1)(\hat\theta - \hat\theta_{-i})
\end{equation}

where $\hat\theta_{-i}$ is defined as the value of the estimate when subject $i$
is omitted from the sample.
As an illustration, evaluate this formula when $\theta$ is an ordinary mean:
\begin{align*}
  \hat\theta +  (n-1)(\hat\theta - \hat\theta_{-i}) &=
   \sum_j y_j/n + (n-1)\left( \sum_{j} y_j/n - \sum_{j\ne i} y_j/(n-1) \right)\\
  &= \sum_j y_j - \sum_{j\ne i} y_j \\
  &= y_i
\end{align*}

In this case the pseudo-value has exactly recovered the data value $y_i$.
The idea is to use the pseudo-observations $\theta_{(i)}$ as a replacement
for the incompletely observed data $y_i$, i.e., as a stand-in for
the *uncensored* data.  These pseudo-values are not
censored, allowing for ordinary statistical methods to be applied.
An good overview of pseudo-value based methods for survival is provided by 
Andersen and Perme @Andersen10.

The `pseudo` function in the survival package uses values based on the
infinitesimal jackknife (IJ), i.e.,
\begin{equation}
 \tilde\theta_{(i)} = \hat\theta + n\frac{\partial \hat\theta}{\partial w_i}
\end{equation}

Why use IJ based pseudo-values instead of jackknife based values?

1. The largest advantage is computational; IJ values can be assembled
    much more quickly.  
2. The survival package already makes extensive use of IJ values to
    compute robust variance estimates, so these naturally fit into that
    framework.  We take advantage of existing test suites to ensure
    accurate computations.

Both estimates are linear approximations to a functional surface, something 
nicely pointed out by Efron @Efron82: the IJ is a tangent plane to the
surface at its center and the jackknife a secant plane.  
As such, they have the property
that the average of the IJ pseudo-values will recover the starting estimate,
mean$(\tilde\theta_{(i)}) = \hat\theta$.  The choice of $n$ for the IJ based
pseudo-value makes the computation for an ordinary mean exact, as was shown above
for $n-1$ and the jackknife pseudo-value, 
and in fact Efron argues that the
use of $n-1$ is somewhat arbitrary, i.e., $\theta$ statistics other 
than the mean might
be more exact using $n$, $n+1$ or some other multiplier.
The largest potential deficit of the IJ based approach is that the literature
is small --- most of the direct examinations have focused on jackknife
based values. 
Parner et. al @Parner21 examine the IJ approach in detail, and show that
they have the same asymptotic properties as the jackknife pseudo-values.
Examples show that the numerical difference between the two is minuscule, with
the possible exception of large outliers. 
  
# Residual mean survival time

We will start with one of the more compelling uses, which provides modeling
tools for the mean time spent in a state. For any positive probability
distribution, a well known identity is that the mean is equal to the area under
the survival curve.

\begin{equation*}
 \mu = \int S(t) dt
\end{equation*}

This extends to the multi-state case, where the expected time in state will be
equal to the area under the P(state) curve for that state.

Since the Kaplan-Meier gives an unbiased estimate of $S$, we can use the area
under the KM to estimate the mean time to death. However, since the entire
$S(t)$ curve is usually not available, i.e., the KM terminates before reaching
0, we instead estimate the restricted mean survival time (RMST), using the area
under the KM up to some specified point $\tau$. This is interpreted as the
expected number of life years, out of the first $\tau$ years since initiation.
If $\tau=10$ and the area under the curve were 8.1, the relevant phrase would be
an "expected lifetime of 8.1 out of the next 10 years". Other common labels the
sojourn time, or for a multi-state model, the restricted mean time in state
(RMTS). The sojourn time for all states must sum to $\tau$, i.e. everyone has to
be somewhere. Common choices for $\tau$ are either a fixed time of interest such
as 2, 5 or 10 years, the last observed event time, or the point at which any one
of the curves has no subjects.

Here is a simple example using the lung cancer data set. 

```{r lung1}
lfit0 <- survfit(Surv(time/365.25, status) ~ 1, data=lung)
lfit1 <- survfit(Surv(time/365.25, status) ~ ph.ecog, data=lung)
print(lfit1, rmean=2.5)
plot(lfit1, lty=1:4, xlab="Years since enrollment", ylab="Survival")
legend("topright", c("PS 0", "PS 1", "PS 2", "PS 3"), 
       lty=1:4, bty='n')
```

The strongest predictor in the data set is ECOG performance score, which has
levels of 0-3. For a single categorical predictor such as this, it is easy to
obtain the RMST and its standard error directly from the print routine for
`survfit`. We can also do this with pseudo-values, which allow for a
multivariate model. The result shows a loss of about .26 years for each 1 point
increase in the performance score, and about .32 years longer RMST for females.

```{r lung2}
pmean <- pseudo(lfit0, times=2.5, type="RMST")
ldata <- data.frame(lung, pmean= c(pmean), 
                    id=1:nrow(lung))  
afit1 <- lm(pmean ~ ph.ecog + sex + age, data=ldata)
round(summary(afit1)$coefficients, 3)
```

The IJ residuals for observations that are censored early in the study are of
necessity small, as they have less opportunity to affect the results, which in
turn means that the pseudo-values are not equivariant. An observation censored
before the first event has residual 0, so its pseudo-value has no variance at
all. In such a case theory argues for using White's variance estimate for the
linear model fit, which accounts for possible heteroscedasticity. This is the
same as the "working independence" estimate of a GEE model, or the variance
estimate from a survey sampling approach.  We compute both of these below.

```{r geese}
afit2 <- geese(pmean ~ ph.ecog + sex + age, data=ldata, 
               subset = (!is.na(ph.ecog)))
round(summary(afit2)$mean, 3)


ldesign <- svydesign(data=ldata, id= ~id, weights=NULL,
              variables= ~ . -id)
afit3 <- svyglm(pmean ~ ph.ecog + sex + age, design=ldesign)
round(summary(afit3)$coefficients, 3)
```

The estimated coefficients will be identical for all three approaches, as
evidenced above. In this particular case the robust and normal standard errors
hardly differ, which we have found to be the usual case when there is a single
pseudo-value per subject. When there are pseudo-values at multiple reporting
times, and thus multiple observations per subject, then the correction becomes
essential. Deficiencies in R's `geese` function lead us to use the survey
sampling approach from here forward, in particular that any missing values cause
geese to fail, the input data is required to be sorted by id, and that standard
extraction functions such as `coef` and `vcov` have not been implemented. The
`survey` package requires that the design be specified in advance; the resulting
object acts as the data for subsequent fits.

The validity of pseudo-values relies on an assumption of non-informative censoring.
When censoring is uninformative within strata, but not overall, a solution is to
base the pseudo-values on the per-stratum survival curves. For example, the
`lung` data set comes from a multi-center study, and contains an identifier for
the institution from which the subject was recruited. We might want to adjust
for the possibility that different institutions recruit from different patient
populations, with different overall survival. If those institutions had
different median follow-up times as well as differential survival, this would
introduce informative censoring. (Differential follow-up of quite common in
multi-center studies; some will join late due to administrative delays.)

```{r strata}
# There is 1 missing value for inst so pseudo by default will have 227 values instead of 228
# When creating pmean4, you need to put the values in the correct spot.
# Specifying na.exclude expands the results to have the proper length

lfit4 <- survfit(Surv(time/365.25, status) ~ inst, data=lung, na.action=na.exclude)
pmean4 <- pseudo(lfit4, time=2.5, type="auc")
ldata$pmean4 <- c(pmean4)
afit4a <- lm(pmean4 ~ ph.ecog + sex + age, data=ldata)

# survey package does not allow a missing strata, and one obs in the lung data
# has institution missing
temp <- subset(ldata, !is.na(inst))
ldesign4 <- svydesign(data=temp,  id= ~id, weights=NULL, strata= ~inst,
                      variables= ~ .-id -inst)
afit4b <- svyglm(pmean4 ~ ph.ecog + sex + age, design=ldesign4)
round(summary(afit4a)$coefficients, 3)
round(summary(afit4b)$coefficients[1:4,], 3)
```

It is interesting that correct variance is somewhat smaller than the simple glm
result, in this case.

# Survival and probability in state

## AML data

As a first case for this endpoint look at a simple survival data set, the well
known AML study with 23 subjects, which records the time to relapse for patients
with acute myelogenous leukemia. Relapse times range from 5 to 48 months, 5 of
the 23 patient times are censored. Get the pseudo-values for 12 and 24 month
survival.

```{r test1}
with(aml, Surv(time, status))
fit1 <- survfit(Surv(time, status) ~1, aml)
rr1 <- resid(fit1, times=c(12,24))
pv1 <- pseudo(fit1, times= c(12, 24))
round(pv1[1:8,], 4)
``` 

Both the matrix of IJ values from `resid` and the matrix of IJ pseudo-values from
the `pseudo` function have one row per subject and one column per reporting
time. The first censoring is at 13 months, and so at the first reporting time of
12 months the pseudo-values behave exactly like a mean, and have recaptured the
(uncensored) 0/1 response of "relapse within 12 months". At 24 months, after
censoring enters, the pseudo-values are no longer constrained to lie in $(0,1)$.

What happens if we use these values in an ordinary regression? The `data.frame`
argument causes the values to be returned in long form as a data.frame.

```{r test2}
pdata <- pseudo(fit1, times= c(12, 24), data.frame=TRUE)
lfit1 <- lm(pseudo ~1, pdata, subset= (time==12))
lfit2 <- lm(pseudo ~1, pdata, subset= (time==24))

summary(lfit1)$coefficients
summary(lfit2)$coefficients

summary(fit1, time=c(12,24))
```

The regressions have exactly reproduced the KM values at 12 and 24 months, as
expected, and the estimated standard errors almost match those from the
`survfit` function. The robust (IJ) variance for the Kaplan-Meier, i.e. the sum
of squared IJ values, can in fact be shown to exactly equal the Greenwood
estimate of variance for a KM (Anne Eaton, personal communication). The
difference above is due to the fact that the lm function uses $n-1$ rather than
$n$ in computing a variance.

A natural follow-on is to look at covariates.  In the AML data set the covariate
`x` denotes the two treatment arms.

```{r test3}
cfit3 <- coxph(Surv(time, status) ~x, data=aml)
cfit3

pdata <- cbind(pdata, x= aml$x)   # original data + pseudo-values
lfit3 <- lm(pseudo ~ x + factor(time), data=pdata)
summary(lfit3)$coefficients

sfit3 <- survfit(cfit3, newdata=list(x= c("Maintained", "Nonmaintained")))
summary(sfit3, times=c(12,24))
```

The above code contains a lot of ideas. First, the Cox model estimates that the
subjects on the no maintenance arm have a hazard rate of about 2.5 fold higher
than the maintenance arm, $p=.09$. Second, a linear model using pseudo-values has
estimated the absolute probability of no recurrence, at 12 and 24 months, to be
about .23 lower for the no maintenance group, $p=.12$; conversely the
probability of recurrence is .23 higher. Last is a plot comparing the curves.
The overall probabilities of recurrence are gathered into a table below.

```{r test4, echo=FALSE}
temp <- matrix(0, 2, 4, dimnames=list(c("Maintainance", "Nonmaintainance"),
                                      c("Cox, 12", "Pseudo, 12", 
                                        "Cox, 24", "Pseudo, 24")))
temp[,c(1,3)] <- 1- t(summary(sfit3, times=c(12,24))$surv)
dummy <-  expand.grid(x= c("Maintained", "Nonmaintained"),
                      time=c(12, 24))
temp[,c(2,4)] <- 1- predict(lfit3, newdata= dummy)
temp <- rbind(temp, "Difference" = temp[2,]- temp[1,])
round(temp,2)
```

One advantage of the linear model is that the pseudo-values are direct estimates
of a probability, and so may be easier to communicate to study participants than
a hazards ratio. The linear model contains the strong assumption that the
difference in survival is the same at times 12 and 24, the Cox model the equally
strong one that hazards are proportional across all time points. Of potentially
more consequence, the pseudo-value has two observations for each subject,
leading to correlated data. We can correct for this using a robust sandwich
estimator either via survey sampling or a GEE argument. These are shown below.
The `geese` function has the unfortunate feature that correct standard errors
require that all of the observations for a subject are in contiguous rows of the
input data.  No error message arises if this does not hold, only an incorrect
result.

```{r svy}
# Naive lm
summary(lfit3)$coefficients
#
#survey
# When an id is constructed rather than supplied, pseudo() purposely gives
#  it a non-standard name to avoid confusion with any user's variable names.
#  But that name is a PITA to use in formulas
flag <- toupper(names(pdata))=='(ID)'
pdata$id <- pdata[,flag] 
pdesign <- svydesign(id= ~ id, varibles= ~ pseudo + x + time, weights=NULL,
                     data=pdata)
sfit <- svyglm(pseudo ~ x + factor(time), design=pdesign)
summary(sfit)$coefficients
#
# GEE 1
gfit1 <- geese(pseudo ~ x + factor(time), data=pdata, id= id) # wrong answer
summary(gfit1)$mean
#
# GEE 2
pdata <- pdata[order(pdata$id),]  # group id values to be together
gfit2 <- geese(pseudo ~ x + factor(time), data=pdata, id= id)
summary(gfit2)$mean
```

All of the approaches give the same coefficient estimates, differing only in 
the estimated standard error. 

## Colon cancer

Now work with a larger data set, using time to failure (recurrence or death) in
the colon cancer study. The data set has 2*929 observations, the first set for
time to recurrence and the second for time to death. For anyone without
recurrence, their follow-up time for recurrence is equal to their follow-up time
for death.

```{r cdata}
cdata <- subset(colon, etype==1, -etype)  # time to recurrence
temp  <- subset(colon, etype==2)
cdata$status <- pmax(cdata$status, temp$status)

trtsurv <- survfit(Surv(time, status) ~ rx, cdata, id=id)
plot(trtsurv, fun="event", xscale= 365.25, col = 1:3, lwd=2,
     xlab= "Years from randomization", ylab="Treatment failure")
legend(5*365, .25, levels(cdata$rx), col=1:3, lwd=2, bty='n')

ccox <- coxph(Surv(time, status) ~ rx + sex + age + extent + node4+
                  obstruct + perfor + adhere,  data=cdata)
```

The graph shows that the Levamisole + 5-FU treatment are is markedly superior to
either observation or 5-FU alone (the standard treatment at the time). There are
few further events after 4--5 years and the curves flatten out. Having 4 or more
positive lymph nodes and greater local spread of the disease are also potent
predictors. Now look at pseudo-values for the data.

```{r colon2}
csurv <- survfit(Surv(time, status)  ~1, data=cdata, id=id)
cptemp <- pseudo(csurv, time= 1:6 * 365.25, data.frame=TRUE)
cptemp$pdeath <- 1- cptemp$pseudo
cptemp$year <- round(cptemp$time/365.25)

cpdat <- merge(cptemp, 
               subset(x=cdata,select=c(id, rx, extent, node4)), 
               by="id")
```
 
Most users will prefer a model for the risk of death, i.e., for pseudo-values
based on the probability of death $1- S(t)$. It is easy to show that these are
simply $1- p(S)$ where $p(S)$ are the pseudo-values for $S$.  This was saved as
the variable `pdeath`. We then merge the pseudo data, which has multiple rows
per subject, with the original data, by the subject identifier `id` found in the
colon cancer data set. Because id was specified in the survfit call it carries
through and properly labels the pseudo-values.

```{r}
hist(cpdat$pdeath, nclass=50, main=NULL)
```

This shows that for a large data set with moderate censoring, the pseudo-values
are clustered around 0 and 1, mimicking binomial data. To assess survival, which
values should we use: year 4 alone, 2 or 3 of the 6 years, or all of them? One
important consideration for all of this is the choice of a transformation
function $f$, where we assume that $E(y) \approx f(\eta) + \epsilon$.
(Generalized linear model literature normally focuses on the \emph{link}
function $g= f^{-1}$.)

The ideal function $f$ will 

* Transform treatment effects to a common scale over time.  That is, a
    separate intercept for each time point will suffice.  No interactions between
    covariates and time are needed.
* Cause multivariate effects to be additive.  That example, the effect of
    treatment is the same for those with and without 4+ positive lymph nodes.
    No between covariate interactions are needed.
* Normalize the variance so that $\epsilon$ is constant across time 
     and across covariates.  Alternatively, choose a distribution that 
     properly maps between the predicted mean and variance.
* Bound predicted values to the range of 0--1.


Satisfying all four of these at once is likely to be impossible. Logistic
regression, most user's immediate response to estimation of a yes/no question,
fails directly.  First, it does not accommodate response values outside the
range of 0--1, and secondly the variance for a predicted value near 0 or 1 is
assumed to drop to zero.  The maximum value of 1.4 for `cpdat\$pdeath` fails
both of these.

As a first pass, look at the between curve difference for the other arms vs.
5-FU across time, based on the Kaplan-Meier estimates. For this particular data
set and these time points, absolute difference between the curves is more stable
across time than logit differences.

```{r c6}
temp1 <- summary(trtsurv, times= 1:6*365.25)
temp2 <- matrix(temp1$surv, nrow=6)
temp3 <- rbind("Obs vs 5FU"= temp2[,1]- temp2[,2],
            "Lev vs 5FU"= temp2[,3]- temp2[,2]) 

cat("absolute difference\n")
round(temp3*100, 1)
#
cat("logit difference\n") 
logit <- function(x) log(1/(1-x))
temp4 <- rbind("Obs vs 5FU"= logit(temp2[,1])- logit(temp2[,2]),
            "Lev vs 5FU"= logit(temp2[,3])- logit(temp2[,2]))
round(temp4*100, 1)
```

Based on this, do a first model with linear effects. The next question is which
time points to use, and how many. First, look at 6 models with a single time
point, each containing treatment, extent and nodes, but summarize only the
levamisole coefficient.

```{r times}
onetime <- matrix(0, 6, 4,
                  dimnames=list(paste("Time", 1:6), 
                                c("coefficient", "se.glm", "coef", "se.survey")))
cpdesign <- svydesign(id= ~id, weights=NULL, data=cpdat,
                      variables = ~ . - id)
for (i in 1:6) {
    fit1 <- lm(pdeath ~ rx + extent + node4, data=cpdat, subset= (year==i))
    onetime[i,1:2] <- summary(fit1)$coefficients[3,1:2]
    sfit1 <- svyglm(pdeath ~ rx+ extent + node4, design=cpdesign, 
                    subset= (year==i))
    onetime[i,3:4] <- summary(sfit1)$coefficients[3, 1:2]
}

onetime <- cbind(onetime, "coef/se"= onetime[,3]/onetime[,4])
round(onetime[, c(1,2,4,5)],3)
```

If you are only going to use one time point, then a robust variance is
apparently not necessary, at least in this case, and the best time point in
terms of $z$ statistic or power, by an admittedly small margin, is 4--5 years
when the results have largely matured. This happens to be the largest estimated
gain for levamisole, reducing the absolute failure rate by 16%. A robust
variance is called for not only when there are multiple observations per
subject, but in the case of heteroscedasticity, however. We should therefore not
be too quick to declare it unnecessary based on a single example.

Now consider combination of times 2,4,6, or all 6 time points.

```{r twotime}
pfun <- function(x,d=2) printCoefmat(x, digits=d, P.values=TRUE,
                                 has.Pvalue=TRUE, signif.stars= FALSE)

sfit1 <- svyglm(pdeath ~ rx + extent + node4,
                        design= cpdesign, subset=(year==4))
pfun(summary(sfit1)$coefficients[2:5,])

#
sfit3 <- svyglm(pdeath ~ rx + extent + node4 + factor(year), 
                        design= cpdesign, 
                subset= (year==2 | year==4 | year==6))
pfun(summary(sfit3)$coefficients[2:5,],)

sfit6 <- svyglm(pdeath ~ rx + extent + node4 + factor(year),
                design= cpdesign)
pfun(summary(sfit6)$coefficients)
```

For the coefficient of interest, the treatment effect, the use of 1, 3, or 6
time points hardly changes the t-statistic or the estimate. All the remaining
coefficients are also fairly stable. The year coefficients show the baseline
rate creeping up.

#### Logit link

Although a linear link was indicated for the colon data, the logit link is more
common. Directly using it will fail, however consider the following

```{r catch}
tryCatch( svyglm(pdeath ~rx + extent + node4 + factor(year), design=cpdesign,
              family= gaussian(link = "logit")),
         error= function(e) e)
```

The same out of range error occurs with the simple `glm` function. The issue is
that the glm function uses the logit link of $f(x) = \log(x/(1-x))$ to create
starting estimates for the iteration, and values outside of $(0, 1)$ lead to a
missing value. (That is actually the only place the link is used.) Two choices
are to give explicit initial values, or to define our own link. Here is an
example of the first, i.e., give starting estimates which are "good enough".

```{r link1}
sfit6b <- svyglm(pdeath ~rx + extent + node4 + factor(year), design=cpdesign,
              family= gaussian(link = "logit"), 
              etastart= pmax(.05, pmin(.95, cpdat$pdeath)))
```

A problem with this is that if we add a subset argument, then the starting
estimate also needs to be explicitly trimmed down. An alternate is to define our
own link function, based on the code found in the documentation for `family`, or
on examination of the glm `make.link` function.  We call the resulting function
`blogit` for "bounded logit", and have included it in the survival package.

```{r blogit}
blogit <- function(edge=.05) {
    new <- make.link("logit")
    new$linkfun <- function(mu) { 
        x <- (pmax(edge, pmin(mu, 1-edge)))
        log(x/(1-x))
    }
    new
}
sfit3c <- svyglm(pdeath ~rx + extent + node4 + factor(year), design=cpdesign,
              family= gaussian(link = blogit()), subset=(year %in% c(2,4,6)))
pfun(summary(sfit3c)$coefficients)
```

Though common, results from the logit link are harder to interpret than the
linear link.  The latter estimates that Levamisole+5FU will reduce the
probability of death by 14%, on average, for time points from years 2--6,
while the logit link predicts an exp(-.645)= .52 odds of relapse.

#### Complementary log-log link

If a Cox model holds, then 
\begin{align*}
  1 - F(t) &= S(t) = e^{- \Lambda_0(t) e^{(X\beta)}} \\
  \log(-\log(S(t)) &= \log[\Lambda_0(t)]  + X \beta
\end{align*}

which motivates using the following link function.

```{r loglog}
bcloglog <- function(edge=.05) {
    new <- make.link("cloglog")
    new$linkfun <- function(mu) {
        x <- (pmax(edge, pmin(mu, 1-edge)))
        log(-log(x))
    }
    new$name <- "bcloglog"
    new
}
sfit3d <- svyglm(pseudo ~rx + extent + node4 + factor(year), design=cpdesign,
              family= gaussian(link = bcloglog()), subset=(year %in% c(2,4,6)))
pfun(summary(sfit3d)$coefficients)
cfit <- coxph(Surv(time, status) ~ rx + extent + node4, cdata)                 
round(summary(cfit)$coefficients, 3)
```

And indeed, the coefficients of the glm model closely mimic a coxph fit.

# Competing risks

In a competing risks situation, let $p_j(t)$ be the probability of being in
state $j$ at time $t$.
The Fine-Gray model assumes that $1- p_1(t) = exp(-exp(\beta_0(t) + \beta X))$
or equivalently that $\log(-\log(1- p_1(t))) = \beta_0(t) + \beta X$,
where outcome $j=1$ is assumed to be the event of interest.
We can use pseudo-values from an Aalen-Johansen estimate of $p_j$,
along with a complimentary log-log link GLM, to estimate the same
quantity.
We will use the `mgus2` data set, which has been used to
illustrate competing risks elsewhere in the survival vignettes. 


```{r mgus2cr}
mdata <- mgus2
mdata$etime <- with(mdata, ifelse(pstat==1, ptime, futime))
mdata$event <- factor(with(mdata, ifelse(pstat==1, 1, 2*death)), 0:2,
                      c("censor", "PCM", "Death"))
mdata$age10 <- mdata$age/10  # age in decades

msurv <- survfit(Surv(etime, event) ~1, data=mdata, id=id)
plot(msurv, lty=1:2, xscale=12, 
     xlab="Years from Diagnosis", ylab="Probability in state")
text(c(345, 340), c(.18, .73), c("PCM", "Death w/o PCM"))
```

At 30 years post diagnosis approximately 13% of the subjects have experienced a
plasma cell malignancy (PCM), while 78% have died without PCM. Fit the model
using either 30 time point for the pseudo-values (years 1--30), 10 (every 3
years) or 5 (every 6). The pseudo-value matrix has 3 columns corresponding to
each of the 3 states, we focus on the PCM outcome (column).

```{r mgus2p}
mps <- pseudo(msurv, times=12*(1:30), type="pstate", data.frame=TRUE)
mdata6 <- merge(mdata, subset(mps, state== "PCM"), by="id")
mdata6$year <- mdata6$time/12

mdesign <- svydesign(data=mdata6, id=~id, weights=NULL, variables= ~. -id)
mpfit1 <- svyglm(pseudo ~ age10 + sex + mspike + factor(year), design = mdesign,
                 family = gaussian(link =bcloglog()))
mpfit2 <- svyglm(pseudo ~ age10 + sex + mspike + factor(year), design = mdesign,
                 subset = (year %in% (3* 1:10)),
                 family = gaussian(link =bcloglog()))
mpfit3 <- svyglm(pseudo ~ age10 + sex + mspike + factor(year), design = mdesign,
                 subset = (year %in% (6* 1:5)),
                 family = gaussian(link =bcloglog()))

fgdata <- finegray(Surv(etime, event) ~ .,  mdata, etype="PCM")
fgfit <-  coxph(Surv(fgstart, fgstop, fgstatus) ~ age10 + sex + mspike, 
                data=fgdata, weights = fgwt)

temp <- cbind(summary(fgfit)$coef[1:3, c(1,3)],
              summary(mpfit1)$coef[2:4, 1:2],
              summary(mpfit2)$coef[2:4,1:2],
              summary(mpfit3)$coef[2:4,1:2])
colnames(temp) <- c("FGcoef", "FGstd", "Coef30", "Std30", 
                    "Coef10", "Std10", "Coef5", "Std5")

round(temp[,c(1,3,5,7, 2,4,6,8)], 3)
```

We see that using pseudo-values with a complimentary log-log link 
returns coefficients that are very close to the formal Fine-Gray
method, within 1/5 of a standard error or less, but with somewhat
larger estimated standard errors.  The intercept terms from the GLM
model (not shown) estimate a cumulative baseline hazard.
Other links are of course possible, and may be more appropriate in a
given situation.

One lack in the pseudo-value approach is that it does not extend to
time-dependent covariates, unlike the Fine-Gray model. A second concern is that
in a data set with left-truncation, it has been shown that the final regression
coefficients will be biased whenever the truncation is covariate dependent
(personal communication, PK Andersen).

# References