
\documentclass[11pt]{article}

\usepackage{amsmath}
\usepackage[utf8]{inputenc}
\usepackage{indentfirst}
\usepackage{natbib}
\usepackage[colorlinks=true,allcolors=blue]{hyperref}
\usepackage{url}
\usepackage{doi}

\newcommand{\fatdot}{\,\cdot\,}

\newcommand{\abs}[1]{\lvert #1 \rvert}

\let\code=\texttt

\DeclareMathOperator{\pr}{pr}

%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Fuzzy Rank Tests and Confidence Intervals}

\begin{document}

\title{Fuzzy Rank Tests and Confidence Intervals}
\author{Charles J. Geyer}
\maketitle

\begin{abstract}
How to do exact-exact (rather than only conservative-exact)
sign, signrank, and ranksum hypothesis tests, whether
or not there are tied ranks.  Also how to do the corresponding
confidence intervals.

Exact-exact procedures must be either randomized or fuzzy.
This package provides the latter.
\end{abstract}

\section{License}

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
\url{http://creativecommons.org/licenses/by-sa/4.0/}.

\section{R}

\begin{itemize}

\item The version of R used to make this document is \Sexpr{getRversion()}.

\item The version of the \texttt{knitr} package used to make this document is
    \Sexpr{packageVersion("knitr")}.

\item The version of the \texttt{fuzzyRankTests} package used to make this
    document is \Sexpr{packageVersion("fuzzyRankTests")}.

\end{itemize}

<<options-width,include=FALSE,echo=FALSE>>=
options(keep.source = TRUE, width = 80)
@

<<libraries>>=
library(fuzzyRankTests)
@

\section{Introduction}

\subsection{What This is About}

We deal with three tests of statistical hypotheses:
\begin{itemize}
\item the sign test,
\item Wilcoxon's signed rank test, and
\item Wilcoxon's rank sum test (also called Mann-Whitney).
\end{itemize}
And we deal with two issues with these.
\begin{itemize}
\item Like all tests with discrete test statistics, exact tests are
    impossible unless the test is randomized.
\item Tied data and tied ranks complicate the situation.
\end{itemize}
Assumptions:
\begin{itemize}
\item One Sample or Paired Comparison
\begin{itemize}
\item Sign test: no assumptions.
\item Signed rank test: symmetric population distribution.
\item $t$ test: normal population distribution.
\end{itemize}
\item Two Independent Samples
\begin{itemize}
\item Rank sum test: one population distribution is the other shifted.
\item $t$ test: both population distributions normal with same variance.
\end{itemize}
\end{itemize}
This package does not do $t$ tests, see R function \code{t.test} in core R
for that.  We only include them to show that the assumptions get
more restrictive as one goes down the list.

For non-fuzzy tests the assumptions above need an additional assumption
that the population distribution is continuous so there are no tied data
or tied ranks.  As will be seen, fuzzy tests and confidence intervals do
not need this assumption.

\subsection{Fuzzy Tests and Confidence Intervals}

Despite being the official theory of testing statistical hypotheses
since it was invented by Neyman and Pearson in the 1930's
\citep[Chapters~3 and~4]{tsh-4th-ed} and despite being taught to all
PhD statistics students, the theory
of randomized hypothesis tests gets little application (I have never seen it
used) because of the arbitrariness of the artificial randomization.
Two statisticians can analyze exactly the same data using exactly the
same hypothesis test and come to opposite decisions due to the artificial
randomization.

\citet{geyer-meeden} proposed a simple fix for this issue: ``unrandomize''
randomized tests in the sense that one reports not a decision or a $P$-value
or a confidence interval that purports to be a realization of some random
process (the artificial randomness in the hypothesis test) but rather report
(a description of) the probability distribution of that random quantity.
That is we report \emph{abstract} randomness rather than \emph{realized}
randomness.

In more detail, a randomized hypothesis test rejects the null hypothesis
with probability $\phi(X)$ when test statistic $X$ is observed.
This function $\phi$ is called the \emph{critical function} of the test.
\citet{geyer-meeden} point out that the critical function also depends on
the significance level $\alpha$ and the value of the parameter hypothesized
under the null hypothesis (for one-tailed tests, the boundary point of the
composite null hypothesis).  So they write the critical function
$\phi(x, \alpha, \theta)$.  And they say the result of the test is to report
this critical function, not some realization of some random variable related
to it.

\citet{geyer-meeden} go on to point out three different interpretations of
the critical function.
\begin{itemize}
\item The function $\phi(\fatdot, \alpha, \theta)$ is the critical function
    of the randomized test, as considered classically.
\item The function $\phi(x, \fatdot, \theta)$ is the (distribution function
    of) the abstract randomized (also called \emph{fuzzy}) $P$-value of
    the randomized test.
\item The function $1 - \phi(x, \alpha, \theta)$ is the (membership function
    of) the \emph{fuzzy confidence interval}) that is dual to the randomized
    test.
\end{itemize}

There is no difference between $\phi(x)$ used classically
and $\phi(x, \alpha, \theta)$ used by \citet{geyer-meeden} when considered
as a function of $x$ for fixed $\alpha$ and $\theta$.  It is the same function
of $x$ either way.  \citet{geyer-meeden} say what one should report is the
number $\phi(x, \alpha, \theta)$ rather than a decision (accept or reject
the null hypothesis that purportedly has this number as its probability
of rejection).

In order for the function $\phi(x, \fatdot, \theta)$ to be a distribution
function, the hypothesis test need only have nested critical regions
\citep[equation~(1.4) and the surrounding discussion]{geyer-meeden} and be
continuous (which property our applications have).  If we were to generate
a random variable $P$ having this distribution function, then rejecting the
null hypothesis when $P < \alpha$ is the classical randomized test.  Hence
this is the $P$-value of that test.
\citet{geyer-meeden} are only saying that rather than simulating such a $P$
and reporting that number, one should report its distribution as described
by the distribution function $\phi(x, \fatdot, \theta)$ or perhaps by the
probability density function of that distribution function.

The function $1 - \phi(x, \alpha, \fatdot)$ takes value between zero and one,
including (if the test is actually randomized) values strictly between zero
and one.  \citet{geyer-meeden} suggest we interpret this as the membership
function of a fuzzy set, as in fuzzy set theory \citep*{fuzzy-book}.  One
interprets the membership function as saying to what degree the point is
in the fuzzy set.  \citet{geyer-meeden} say one should interpret it like
partial credit on a test question.  After all, that is what probability does.
The coverage probability of the interval is
$$
   E_\theta\{1 - \phi(X, \alpha, \theta)\} = 1 - \alpha
$$
and this means point $x$ is being given ``partial credit''
$1 - \phi(x, \alpha, \theta)$ when $\theta$ is the true unknown parameter
value.

\subsection{Tied Data or Tied Ranks}

Tied data (data points tied with the hypothesized value under the null
hypothesis) or tied ranks (for the signed rank test
or for the rank sum test) bring more issues.  We deal with these using the
methods of \citet{thompson-geyer}.

Now our model has data in two parts: the observable part $x$ and the
unobservable part $y$ (also called missing data, latent variables,
random effects, or hidden layer).  So we write the critical function
of our randomized test $\psi(x, y, \alpha, \theta)$.  Then
\begin{equation} \label{eq:average-critical-function}
   \phi(x, \alpha, \theta) = E_\theta\{ \psi(x, Y, \alpha, \theta) \}
\end{equation}
is the critical function for the test based on the observed data $x$.

\subsection{What this Package Does}
\label{sec:what-we-do}

For all three hypothesis tests this package does, the null distribution of
the test statistic is discrete and symmetric.  Let $T$ be the test statistic
for an upper tailed test and $\tau$ be the center of symmetry of its null
distribution.  Then $- T$ is the test statistic for the lower tailed test,
and $\abs{T - \tau}$ is the test statistic for the two-tailed test.

In all three cases, the fuzzy $P$-value is uniformly distributed on the
interval with endpoints $\pr_\theta(W > w)$ and $\pr_\theta(W \ge w)$,
where $W$ is the test statistic considered as a random variable and $w$
is its observed value.

Hence the critical function of the test is
$$
   \phi(w, \alpha, \theta)
   =
   \begin{cases}
       0, & \alpha \le \pr_\theta(W > w) \\
       \frac{\alpha - \pr_\theta(W > w)}{\pr_\theta(W = w)}, &
           \pr_\theta(W > w) < \alpha < \pr_\theta(W \ge w) \\
       1, & \pr_\theta(W \ge w) \le \alpha
   \end{cases}
$$
when there are no ties in the data or the ranks.

When there are ties in the data or the ranks, we assume the data have been
measured with inadequate precision.  If more precise measurement had been
used there would be no ties in the data or the ranks.  We assume that all
orderings of the hypothetical precise data consistent with the observed
(imprecise) data are equiprobable (since there is no data favoring any such
ordering).

Thus the critical function when there are ties is just the average of the
critical functions \eqref{eq:average-critical-function}
for the precise data (with no ties)
consistent with the observed imprecise data.

\subsection{Ordered Categorical Data}

We do not recommend the procedures in this package as competitors for
procedures for ordered categorical data \citep[Sections~8.2 and~8.3]{agresti}.
If one has ordered categorical response data, then one should probably use
statistical models and procedures designed specifically for that.

But if the ordered categories have arisen from imprecise measurement,
then one could also justify using the fuzzy procedures this package provides
for such data.

\subsection{Other Procedures for Tied Data or Tied Ranks}

We take \citet*{hollander-et-al} to be authoritative about existing practice.

\subsubsection{Sign Test}

For the sign test, their recommended procedure is to report the usual
$P$-value for a discrete test: $\pr_\theta(W \ge w)$ when there are no
ties (data values equal to the value hypothesized by the null hypothesis).

When there are ties, \citet[Subsection Ties of Section~3.4]{hollander-et-al}
say one should eliminate the ties from the data and then proceed as above.

We say this is unacceptable.  It is cherry-picking data that favor the
alternative hypothesis (suppressing data that favor the null hypothesis).
This correction for ties, although widely used, can never be justified.

To be fair to \citet{hollander-et-al} they say (Comment~34 of Section~3.4)
that one should not do their recommended procedure when the number of ties
``represent a sizable percentage of the total.''  So they already recognize
the wrongness.  They also give two other procedures.
\begin{itemize}
\item A randomized procedure that is what we ``unrandomize'' turning it into
    a fuzzy $P$-value.  They do not like randomized procedures and hence do
    not recommend them.  But we do not either.  Hence the unrandomization,
    which escapes their criticism.
\item A conservative procedure that counts all ties in favor of the null
    hypothesis.  Our procedure also calculates this: its $P$-value is the
    upper endpoint of the support of the distribution of our fuzzy $P$-value.
    So we take that into account (including exactly how conservative it is).
\end{itemize}

\subsubsection{Signed Rank Test}

This section is much like the preceding one \emph{mutatis mutandis}.
The issues surrounding exactness and ties are much the same.  Ranks
bring in a few technical details, which we do not need to emphasize
because the computer does all the work dealing with them.

For the signed rank test, the recommended procedure of
\citet[Section~3.1]{hollander-et-al} is to report the usual
$P$-value for a discrete test: $\pr_\theta(W \ge w)$ when there are no
ties (either data values equal to the value hypothesized by the null
hypothesis or tied ranks).

When there are ties, \citet[Subsection Ties of Section~3.1]{hollander-et-al}
say one should (i) eliminate data values equal to the value hypothesized by
the null hypothesis and (ii) use average ranks when there are tied ranks.
Using average ranks changes the null distribution of the test statistic
to something not easily understood, so one uses the asymptotic normal
distribution of the test statistic under the null hypothesis, which has
its asymptotic variance corrected for ties.

We say (i) is unacceptable.  It is cherry-picking data that favor the
alternative hypothesis (suppressing data that favor the null hypothesis).
Although widely used, it can never be justified.

We also do not need (ii) because we use unrandomized randomized tests
(Section~\ref{sec:what-we-do} above) instead.

To be fair to \citet{hollander-et-al} they say (Comments~9 and~10
of Section~3.1)
that one should not do their recommended procedure unless the ``zero values
are a very small percentage'' of the total.  So they already recognize
the wrongness.  They also give two other procedures.
\begin{itemize}
\item A randomized procedure that is what we ``unrandomize'' turning it into
    a fuzzy $P$-value.  They do not like randomized procedures and hence do
    not recommend them.  But we do not either.  Hence the unrandomization,
    which escapes their criticism.
\item A conservative procedure that counts all ties in favor of the null
    hypothesis.  Our procedure also calculates this: its $P$-value is the
    upper endpoint of the support of the distribution of our fuzzy $P$-value.
    So we take that into account (including exactly how conservative it is).
\end{itemize}
They also discuss (Comment~11 of Section~3.1)
another procedure that keeps the tied ranks but uses intensive computation
to calculate the exact permutation distribution conditioning on the pattern
of ties.  Since we have an alternative, we are not interested in this either.

\subsubsection{Rank Sum Test}

For some reason, the discussion in \citet{hollander-et-al} of this test is
not parallel to the other two.  They do not discuss randomized versions of
this test, although they obviously exist and work just as well as for the
other two.  Hence this package does the fuzzy hypothesis tests and confidence
intervals that are justified in the same way as for the other two procedures.

\section{Examples}

\subsection{Sign Test}

\subsubsection{No Zero Values}

For an example with no zero values,
we do Example~{3.5} in \citet{hollander-et-al}
<<beak-data-and-test>>=
z <- c(-0.8, 7.5, 46.9, 17.6, -4.6, 54.0, 48.3, 3.9, 16.7,
    19.7, -8.5, 7.1, 40.7, 23.8, 14.8, 20.6, 25.0, 24.7,
    -1.8, 21.9, 4.7, 24.7, 52.8, 8.5, 1.9)
fuzzy.sign.test(z, alternative = "greater")
@
Since (the support of the distribution of) the fuzzy $P$-value is far below
common criteria of statistical significance, this is strong evidence against
the null hypothesis.  Note that the upper endpoint of the support of (the
distribution of) the fuzzy $P$-value is the conventional $P$-value given
by \citet{hollander-et-al}.

A 95\% fuzzy confidence interval for the median difference is given by
<<beak-ci, align="center", fig.cap="95\\% fuzzy confidence interval for Example 3.5 of Hollander et al. (2014).\\@  Interval dual to sign test.">>=
fuzzy.sign.ci(z) |> plot()
@
Figure~\ref{fig:beak-ci} shows (the membership function of) this fuzzy
confidence interval.  Although we say this example has no ties, that means
it has no ties at the hypothesized value under the null hypothesis, which
in this case is zero.  It does have ties at the upper endpoint of the support
of the fuzzy confidence interval, which affects the value at that point.

\subsubsection{With Zero Values}

For an example with zero values, we make up some data.
<<sign.test.with.zeroes>>=
z <- c(-1.3, -0.4, 0.0, 0.0, 0.3, 0.5, 0.9, 1.1, 1.1, 1.1, 2.3,
    2.5, 3.1, 4.5, 5.5)
fuzzy.sign.test(z)
@

This might be called borderline statistically significant.  It is equivocal.

We can plot the probability density function
(Figure~\ref{fig:sign.test.with.zeroes.plot.pdf}).
<<sign.test.with.zeroes.plot.pdf, align="center", fig.cap="PDF of Fuzzy P-value.">>=
fuzzy.sign.test(z) |> plot()
@

Or we can plot the cumulative distribution function
(Figure~\ref{fig:sign.test.with.zeroes.plot.cdf}).
<<sign.test.with.zeroes.plot.cdf, align="center", fig.cap="CDF of Fuzzy P-value.">>=
fuzzy.sign.test(z) |> plot(type = "cdf")
@

It is left as an exercise for the reader, if he or she is interested, to
remove the zeroes from the data and redo, and then try to defend those results.
(We do not think any defense can be valid.)

The interpretation of the PDF (Figure~\ref{fig:sign.test.with.zeroes.plot.pdf})
is that the area under the curve to the left of $\alpha$ is the probability
the null hypothesis is rejected at level $\alpha$.

The interpretation of the CDF (Figure~\ref{fig:sign.test.with.zeroes.plot.cdf})
is that the height of the curve at $\alpha$ is the probability
the null hypothesis is rejected at level $\alpha$.

The 95\% fuzzy confidence interval is Figure~\ref{fig:sign.ci.with.zeroes}.
<<sign.ci.with.zeroes, align="center", fig.cap="95\\% fuzzy confidence interval for made-up data with ties.\\@  Interval dual to sign test.">>=
fuzzy.sign.ci(z) |> plot()
@

\subsection{Signed Rank Test}

Again, to illustrate the issues with ties, we just make up some data.
Figure~\ref{fig:signed.rank.pdf} is the PDF of the fuzzy $P$-value.
<<signed.rank.pdf, align="center", fig.cap="Signed rank test for made-up data.">>=
z <- c(-2.2, -1.3, -0.3, 0.0, 0.0, 0.3, 0.5, 0.9, 1.1, 1.3,
    1.3, 2.3, 2.5, 3.1, 4.5, 5.5)
fuzzy.signrank.test(z) |> plot()
@

And Figure~\ref{fig:signed.rank.cdf} is the CDF of the fuzzy $P$-value.
<<signed.rank.cdf, align="center", fig.cap="Signed rank test for made-up data.">>=
fuzzy.signrank.test(z) |> plot(type = "cdf")
@

And Figure~\ref{fig:signed.rank.ci} is (the membership function of)
the 95\% fuzzy confidence interval.
<<signed.rank.ci, align="center", fig.cap="95\\% Signed rank confidence interval for made-up data.">>=
fuzzy.signrank.ci(z) |> plot()
@

\subsection{Rank Sum Test}

Again, to illustrate the issues with ties, we just make up some data.
Figure~\ref{fig:rank.sum.pdf} is the PDF of the fuzzy $P$-value.
<<rank.sum.pdf, align="center", fig.cap="Rank sum test for made-up data.">>=
x <- c(1, 2, 3, 4, 4, 4, 5, 6, 7)
y <- c(4, 5, 7, 7, 8, 9, 10, 11)
fuzzy.ranksum.test(x, y) |> plot()
@

And Figure~\ref{fig:rank.sum.ci} is (the membership function of)
the 95\% fuzzy confidence interval.
<<rank.sum.ci, align="center", fig.cap="95\\% rank sum confidence interval for made-up data.">>=
fuzzy.ranksum.ci(x, y) |> plot()
@

\begin{thebibliography}{}

\bibitem[Agresti(2013)]{agresti}
Agresti, A. (2013).
\newblock \emph{Categorical Data Analysis}, third edition.
\newblock John Wiley \& Sons, Hoboken, NJ.

\bibitem[Geyer and Meeden(2005)]{geyer-meeden}
Geyer, C.~J. and Meeden, G.~D. (2005).
\newblock Fuzzy and randomized confidence intervals and $P$-values
    (with discussion).
\newblock \emph{Statistical Science}, \textbf{20}, 358--387.
\newblock \doi{10.1214/088342305000000340}.

\bibitem[Hollander, et al.(2014)Hollander, Wolfe, and Chicken]{hollander-et-al}
Hollander, M., Wolfe, D.~A., and Chicken, E. (2014).
\newblock \emph{Nonparametric Statistical Methods}, third edition.
\newblock John Wiley \& Sons, Hoboken, NJ.

\bibitem[Klir, et al.(1997)Klir, St.\@ Clair, and Yuan]{fuzzy-book}
Klir, G.~J., St.\@ Clair, U.~H., and Yuan, B. (1997).
\newblock \emph{Fuzzy Set Theory: Foundations and Applications}.
\newblock Prentice Hall, Upper Saddle River, NJ.

\bibitem[Lehmann and Romano(2022)]{tsh-4th-ed}
Lehmann, E.~L., and Romano, J.~P. (2022).
\newblock \emph{Testing Statistical Hypotheses}, fourth edition.
\newblock Springer, Cham. 

\bibitem[Thompson and Geyer(2007)]{thompson-geyer}
Thompson, E.~A. and Geyer, C.~J. (2007).
\newblock Fuzzy $P$-values in latent variable problems.
\newblock \emph{Biometrika}, \textbf{94}, 49--60.
\newblock \doi{10.1093/biomet/asm001}.

\end{thebibliography}

\end{document}

