---
title: Combined Regression
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{vignette}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

It is considered bad statistical practice to dichotomise continuous outcomes, but some applications require predicted probabilities rather than predicted values. To obtain predicted values, we recommend to model the original continuous outcome with *linear regression*. To obtain predicted probabilities, we recommend not to model the artificial binary outcome with *logistic regression*, but to model the original continuous outcome and the artificial binary outcome with *combined regression*.

## Installation

Install the current release from [CRAN](https://CRAN.R-project.org/package=cornet):

```{r,eval=FALSE}
install.packages("cornet")
```

Or install the development version from [GitHub](https://github.com/rauschenberger/cornet):

```{r,eval=FALSE}
#install.packages("devtools")
devtools::install_github("rauschenberger/cornet")
```

Then load and attach the package:

```{r}
library(cornet)
```

## Example

We simulate data for $n$ samples and $p$ features, in a high-dimensional setting ($p \gg n$). The matrix $\boldsymbol{X}$ with $n$ rows and $p$ columns represents the features, and the vector $\boldsymbol{y}$ of length $n$ represents the continuous outcome.

```{r,eval=FALSE}
set.seed(1)
n <- 100; p <- 500
X <- matrix(rnorm(n*p),nrow=n,ncol=p)
beta <- rbinom(n=p,size=1,prob=0.05)
y <- rnorm(n=n,mean=X%*%beta)
```

We use the function `cornet` for modelling the original continuous outcome and the artificial binary outcome. The argument `cutoff` splits the samples into two groups, those with an outcome less than or equal to the cutoff, and those with an outcome greater than the cutoff.

```{r,eval=FALSE}
model <- cornet(y=y,cutoff=0,X=X)
model
```

The function `coef` returns the estimated coefficients. The first column is for the linear model (beta), and the second column is for the logistic model (gamma). The first row includes the estimated intercepts, and the other rows include the estimated slopes.

```{r,eval=FALSE}
coef <- coef(model)
```

The function `predict` returns fitted values for training data, or predicted values for testing data. The argument `newx` specifies the feature matrix. The output is a matrix with one column for each model.

```{r,eval=FALSE}
predict <- predict(model,newx=X)
```

The function `cv.cornet` measures the predictive performance of combined regression by nested cross-validation, in comparison with logistic regression.

```{r,eval=FALSE}
cv.cornet(y=y,cutoff=0,X=X)
```

Here we observe that combined regression outperforms logistic regression (lower logistic deviance), and that logistic regression is only slightly better than the intercept-only model.

# References

Armin Rauschenberger
[![AR](https://info.orcid.org/wp-content/uploads/2019/11/orcid_16x16.png)](https://orcid.org/0000-0001-6498-4801)
and Enrico Glaab
[![EG](https://info.orcid.org/wp-content/uploads/2019/11/orcid_16x16.png)](https://orcid.org/0000-0003-3977-7469)
(2024). "Predicting dichotomised outcomes from high-dimensional data in biomedicine". *Journal of Applied Statistics* 51(9):1756-1771. [doi: 10.1080/02664763.2023.2233057](https://doi.org/10.1080/02664763.2023.2233057)

<!--
# Example

This is a confusing example because we would normally use ordinal regression!

Here we analyse the data from Pinho et al. (2016), available under the Gene Expression Omnibus ([GEO](https://www.ncbi.nlm.nih.gov/geo/)) accession number [GSE80599](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80599). Our aim is to predict disease progression from gene expression.

The ordinal outcome $\boldsymbol{y}$ (MDS-UPDRS item 3.12, postural instability) ranges from $0$ (normal) to $4$ (severe), but Pinho et al. (2016) model the binary outcome $\boldsymbol{z}=\mathbb{I}(\boldsymbol{y} \geq 1)$, which indicates slow $(0)$ or rapid $(1)$ progression. 

## Data

We use the Bioconductor packages [`GEOquery`](https://doi.org/10.18129/B9.bioc.GEOquery) and [`Biobase`](https://doi.org/10.18129/B9.bioc.Biobase) for obtaining the data $(\approx 150$MB$)$:

```{r,eval=FALSE}
#install.packages("BiocManager")
#BiocManager::install(c("GEOquery","Biobase"))
data <- GEOquery::getGEO(GEO="GSE80599")[[1]]
pheno <- Biobase::pData(data)
y <- as.numeric(pheno$`updrs-mds3.12 score:ch1`)
age <- as.numeric(pheno$`age at examination (years):ch1`)
gender <- ifelse(pheno$`gender:ch1`=="Female",1,0)
X <- cbind(age,gender,t(Biobase::exprs(data)))
```

The vector $\boldsymbol{y}$ of length $n$ represents the outcome, and the matrix $\boldsymbol{X}$ with $n$ rows and $p$ columns represents the features. The first two columns include demographic variables (age, gender), and the other columns include the gene expression data.

## Filter

We test for marginal association between the ordinal outcome and the features. The smallest adjusted $p$-values is insignificant, and the $p$-value distribution is not right-skewed. There is probably not much signal in the data.

```{r,eval=FALSE}
pvalue <- apply(X,2,function(x) cor.test(x,y)$p.value)
min(p.adjust(pvalue))
hist(pvalue)
```

We filter the features to artificially increase the signal-to-noise ratio. This filtering leads to *bias* and *overfitting*, and must not be done in practical applications!

```{r,eval=FALSE}
cor <- abs(cor(y,X,method="spearman"))
X <- X[,cor>0.3] # forbidden!
```

Pinho R, Guedes LC, Soreq L, Lobo PP, Mestre T, Coelho M, et al. (2016). "Gene Expression Differences in Peripheral Blood of Parkinson’s Disease Patients with Distinct Progression Profiles". *PLoS ONE* 11(6):e0157852. [doi:10.1371/journal.pone.0157852](https://doi.org/10.1371/journal.pone.0157852)

# Cognitive impairment

Rani et al. (2017): GSE97644

```{r,eval=FALSE}
#install.packages("BiocManager")
#BiocManager::install("GEOquery")
files <- GEOquery::getGEOSuppFiles("GSE97644")
pheno <- read.csv(textConnection(readLines(rownames(files)[1])))
y <- pheno$MOCA.Score
gender <- ifelse(pheno$Gender=="Female",1,0)
age <- pheno$Age
geno <- t(read.csv(textConnection(readLines(rownames(files)[2])),row.names=1))
X <- cbind(gender,age,geno)
```

```{r,eval=FALSE}
net <- cornet::cornet(y=y,cutoff=25,X=X)
set.seed(1)
cornet:::cv.cornet(y=y,cutoff=25,X=X)
```

# Other

```{r,eval=FALSE}
files <- GEOquery::getGEOSuppFiles("GSE95640")
X <- t(read.csv(textConnection(readLines(rownames(files)[1])),row.names=1))
y <- GEOquery::getGEO(GEO="GSE95640")[[1]] # no numeric outcome
```

```{r,eval=FALSE}
data <- GEOquery::getGEO(GEO="GSE109597")[[1]]
y <- as.numeric(Biobase::pData(data)$"bmi:ch1")
X <- t(Biobase::exprs(data))
cornet:::cv.cornet(y=y,cutoff=25,X=X,alpha=0)
```

Rani A, O'Shea A, Ianov L, Cohen RA et al. (2017). "miRNA in Circulating Microvesicles as Biomarkers for Age-Related Cognitive Decline". Frontiers in Aging Neuroscience 9:323. [doi:10.3389/fnagi.2017.00323](https://doi.org/10.3389/fnagi.2017.00323)
-->

<!--
$(\hat{\boldsymbol{\beta}})$
$(\hat{\boldsymbol{\gamma}})$
$(\hat{\beta}_0$ and $\hat{\gamma}_0)$
$(\hat{\beta}_j$ and $\hat{\gamma}_j$ $\forall j \in \{1,\ldots,p\})$

### Example ###
```{r,eval=FALSE}
#install.packages("BiocManager")
#BiocManager::install("mixOmics")
set.seed(1)
data(liver.toxicity,package="mixOmics")
X <- as.matrix(liver.toxicity$gene)
Y <- liver.toxicity$clinic
cornet <- cornet::cornet(y=Y$BUN.mg.dL.,cutoff=15,X=X)
cornet:::cv.cornet(y=Y$BUN.mg.dL.,cutoff=15,X=X)

loss <- list()
for(i in seq_along(Y)){
  loss[[i]] <- cornet:::cv.cornet(y=Y[[i]],cutoff=median(Y[[i]]),alpha=0,X=X)
}
sapply(loss,function(x) x$deviance)
```
-->