
---
title: "Problems with P-values"
subtitle: A case study
author: 
  - Norman Matloff
date: "9/28/2023"
crossref:
   labels: roman i
   subref-labels: roman i
format: 
  pdf:
     number-sections: true
output: 
   tufte::tufte_html: default
vignette: >
  %\VignetteIndexEntry{No P Values}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, include = FALSE}
library(tufte)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r}
#| include: false
library(qeML)  
```

In 2016, the American Statistical Association released its first-ever
position paper, to warn of the problems of significance testing and
"p-values."  Though the issues had been well known for many years, it was
"significant" that the ASA finally took a stand.  Let's use the **lsa**
data in this package to illustrate.

# Law School Admissions Data

According to [the Kaggle
entry](https://www.kaggle.com/datasets/danofer/law-school-admissions-bar-passage), this is a

> ...Law School Admissions dataset from the Law School Admissions Council
> (LSAC). From 1991 through 1997, LSAC tracked some twenty-seven thousand
> law students through law school, graduation, and sittings for bar exams.
> ...The dataset was originally collected for a study called 'LSAC
> National Longitudinal Bar Passage Study' by Linda Wightman in 1998.

Here is an overview of the variables:

```{r}
data(lsa)
names(lsa)
```

Most of the names are self-explanatory, but we'll note that:  The two
```{marginfigure}
The 'age' variable is apparently birth year, with e.g. 67 meaning
1967.
```
decile scores are class standing in the first and third years of law
school, and 'cluster' refers to the reputed quality of the law school.
Two variables of particular interest might be the student's score on the
Law School Admission Test (LSAT) and a logical variable indicating
whether the person passed the bar examination.

# Wealth Bias in the LSAT?

There is concern that the LSAT and other similar tests may be
heavily influenced by family income, thus unfair, especially to 
underrepresented minorities.  To investigate this, let's
consider the estimated coefficients in a linear model
for the LSAT

```{r}
w <- lm(lsat ~ .,lsa)  # predict lsat from all other variables
summary(w)
```

There are definitely some salient racial aspects here, but, staying with
the income issue, look at the coefficient for family income, 0.3009.
The p-value is essentially 0, which in an academic research journal
would classically be heralded with much fanfare, termed "very highly
significant," with a 3-star insignia.  Indeed, the latter is seen in the
output above.  But actually, the impact of family income is not 
significant in practical terms.  Here's why:

Family income in this dataset is measured by quintiles.  So
```{marginfigure}
Mathematically, testing for a 0 effect is equivalent to checking
whether the CI contains 0.  But this is missing the point of the CI,
which is to (a) give us an idea of the effect *size*, and (b) to
indicate *how accurate* our estimate is of that size.  Aspect (a) is given
by the location of the center of the interval, while (b) is seen from
the CI's radius.
```
this estimated coefficient says that,
for example, if we compare people who grew up in the bottom 20% of
income with those who were raised in the next 20%, the mean LSAT score
rises by only about 1/3 of 1 point--on a test where scores are typically
in the 20s, 30s and 40s.  The 95% confidence interval (CI),
(0.2304,0.3714), again indicates that the effect size here is very
small.

So family income is not an important factor after all,
and the significance test was highly misleading.

# But Aren't There Setting in Which Significance Testing Is of Value?

Some who read the above may object, "Sure, there sometimes may be a
difference between statistical significance and practical significance.
But I just want to check whether my model fits the data."  Actually,
it's the same problem.

*"I just want to check whether my model fits the data"*`

For instance, suppose we are considering adding an interaction term
between race and undergraduate GPA to our above model.   Let's fit this
more elaborate model, then compare.

```{r}
w1 <- lm(lsat ~ .+race1:ugpa,lsa)  # add interaction 
summary(w1)
```

Indeed, the Black and white interaction terms with undergraduate GPA are
"very highly significant."  But that does mean we should use the more
complex model?  Let's check the actual impact of including the
interaction terms, by doing predictions of both models on an example X
value:

```{r}
typx <- lsa[1,-5]  # set up an example case
predict(w,typx)  # no-interaction model
predict(w1,typx)  # with-interaction model
```

We see here that adding the interaction term changed the
predictions--and the estimated value of the regession function--by only
about 0.02 out of a 40.23 baseline.  So, while the test has validated
our with-interaction model, we may well prefer the simpler,
no-interaction model.

*"I just want to know whether the effect is positive or negative"*

Here we have a different problem, bias.  Our linear model is just that,
a model, and its imperfection will induce a bias.  This could change the
estimated effect from positive to negative or vice versa, **even with an
infinite amount of data**.  With larger and larger dataset size n, the
variance of estimated parameters goes to 0, but the bias won't go away.

*Bottom line:*

We must not take small p-values literally.

# What Is the Underlying Problem, and Its Implications?

The central issue in the above examples, and essentially in any other
testing situation, is that *a significance test is not answering the
question of interest to us.*  

We wish to know whether family income plays a substantial role in the
LSAT, not whether there is any relation at all, no matter how
meaningless.  Similarly, we wish to know whether the interaction between
race and GPA is substantial enough to include it in our model, not
whether there is any interaction at all, no matter how tiny.

The question at hand in research studies is rarely, if ever, whether a
quantity is 0.000... to infinitely many decimal places.  And ask noted, 
our measuring instrument is not this accurate in the first place;
there will always be systemic bias, in our model, our dataset 
and so on.

Thus in almost all cases, significance tests don't address the issue of
interest, which is whether some population quantity is substantial
enough to be considered important.  Analysts should not be misled by
words like "significant." [Modern statistical
practice](https://tinyurl.com/2s7x6h2v) places reduced value, or in the
view of many, no value at all, on significance testing.

