\documentclass[a4paper]{article}
\usepackage{amsmath}%the AMS math extension of LaTeX.
\usepackage{amssymb}%the extended AMS math symbols.
\usepackage{bm}%Use 'bm.sty' to get `bold math' symbols
\usepackage{natbib}
\usepackage{Sweave}
\usepackage{float}%Use `float.sty'
\usepackage[left=3.5cm,right=3.5cm]{geometry}
\usepackage{algorithmic}
\usepackage[utf8]{inputenx}

%%\VignetteIndexEntry{Statistical Methods}
%%\VignetteDepends{sensR}
\title{Statistical methodology for sensory discrimination tests and
  its implementation in \textsf{sensR}}
\author{Rune Haubo Bojesen Christensen}

%% \numberwithin{equation}{section}
\setlength{\parskip}{2mm}%.8\baselineskip}
\setlength{\parindent}{0in}

%%  \DefineVerbatimEnvironment{Sinput}{Verbatim}%{}
%%  {fontshape=sl, xleftmargin=1em}
%%  \DefineVerbatimEnvironment{Soutput}{Verbatim}%{}
%%  {xleftmargin=1em}
%%  \DefineVerbatimEnvironment{Scode}{Verbatim}%{}
%%  {fontshape=sl, xleftmargin=1em}
%%  %% \fvset{listparameters={\setlength{\topsep}{0pt}}}

%%%% \renewenvironment{Schunk}{\vspace{0mm}}{\vspace{0mm}}

\newcommand{\var}{\textup{var}}
\newcommand{\I}{\mathcal{I}}
\newcommand{\bta}{\bm \theta}
\newcommand{\ta}{\theta}
\newcommand{\tah}{\hat \theta}
\newcommand{\di}{~\textup{d}}
\newcommand{\td}{\textup{d}}
\newcommand{\Si}{\Sigma}
\newcommand{\si}{\sigma}
\newcommand{\bpi}{\bm \pi}
\newcommand{\bmeta}{\bm \eta}
\newcommand{\tdots}{\hspace{10mm} \texttt{....}}
\newcommand{\FL}[1]{\fvset{firstline= #1}}
\newcommand{\LL}[1]{\fvset{lastline= #1}}
\newcommand{\s}{\square}
\newcommand{\bs}{\blacksquare}

% figurer bagerst i artikel
%% \usepackage[tablesfirst, nolists]{endfloat}
%% \renewcommand{\efloatseparator}{\vspace{.5cm}}


%%  \usepackage{lineno}
%%  % \linenumbers
%%  \newcommand*\patchAmsMathEnvironmentForLineno[1]{%
%%  \expandafter\let\csname old#1\expandafter\endcsname\csname #1\endcsname
%%  \expandafter\let\csname oldend#1\expandafter\endcsname\csname end#1\endcsname
%%  \renewenvironment{#1}%
%%  {\linenomath\csname old#1\endcsname}%
%%  {\csname oldend#1\endcsname\endlinenomath}}%
%%  \newcommand*\patchBothAmsMathEnvironmentsForLineno[1]{%
%%   \patchAmsMathEnvironmentForLineno{#1}%
%%   \patchAmsMathEnvironmentForLineno{#1*}}%
%%  \AtBeginDocument{%
%%  \patchBothAmsMathEnvironmentsForLineno{equation}%
%%  \patchBothAmsMathEnvironmentsForLineno{align}%
%%  \patchBothAmsMathEnvironmentsForLineno{flalign}%
%%  \patchBothAmsMathEnvironmentsForLineno{alignat}%
%%  \patchBothAmsMathEnvironmentsForLineno{gather}%
%%  \patchBothAmsMathEnvironmentsForLineno{multline}%
%%  }

\begin{document}
\bibliographystyle{chicago}
\maketitle

\begin{abstract}

The statistical methodology of sensory discrimination analysis is
described. This forms the basis of the implementation in the
\textsf{sensR} package for \textsf{R}. Implementation choices will be
motivated when appropriate and examples of analysis of sensory
discrimination experiments will be given throughout using the
\textsf{sensR} package. This document currently covers
parameterizations, hypothesis
tests, confidence intervals, and power and sample size calculations
for the four common discrimination protocols: 2-AFC, 3-AFC, triangle
and duo-trio; analysis of replicated experiments with the four common
discrimination protocols using the beta-binomial and chance-corrected
beta-binomial models.

\end{abstract}

\newpage
\tableofcontents
\newpage

\SweaveOpts{echo=FALSE, results=hide, width=4.5, height=4.5}
\SweaveOpts{prefix.string=figs}
\fvset{listparameters={\setlength{\topsep}{0pt}}, gobble=0, fontsize=\small}
%% \fvset{gobble=0, fontsize=\small}
\setkeys{Gin}{width=.7\textwidth}

<<Initialize>>=

## Load common packages, functions and set settings:
library(sensR)
##
RUN <- FALSE    #redo computations and write .RData files
## Change options:
op <- options() ## To be able to reset settings
options("digits" = 7)
options(help_type = "html")
options("width" = 75)
options("SweaveHooks" = list(fig=function()
        par(mar=c(4,4,.5,0)+.5)))
options(continue=" ")

G.test <- function(x, ...) {
    ct <- eval.parent(chisq.test(x, correct = FALSE))
    ct$data.name <- deparse(substitute(x))
    o <- ct$observed
    e <- ct$expected
    ct$statistic <- 2 * sum(ifelse(o == 0, 0, o * log(o/e)))
    names(ct$statistic) <- "G-squared"
    ct$p.value <- with(ct, pchisq(statistic, parameter, lower = FALSE))
    ct$method <- "Likelihood ratio Chi-squared test"
    return(ct)
}

@

\section{Introduction}
\label{sec:introduction}

The aim of this document is 1) to describe the statistical methodology
for sensory discrimination testing and analysis, and 2) to describe
how such analyses can be performed in \textsf{R} using package
\textsf{sensR} \citep{christensen10d} co-developed by the author of
this document.

This document is divided into sections that cover topics with similar
statistical methodology. Implementation choices in the \textsf{sensR}
package will be described in connection with the statistical
methodology whenever appropriate. Small examples illustrating the use
of functions in the \textsf{sensR} package will appear throughout.

This is not a hands-on practical tutorial to analysis of sensory
discrimination experiments with the \textsf{sensR} package, neither is
it a user friendly introduction to discrimination and similarity
testing in sensory discrimination protocols. The former document does
not really exist\footnote{this is on the to-do list of the author of
  this document, so there is hope it will appear in the future.}
(yet), and for the latter document, we refer the reader to
\citep[][chapter 7]{naes10}. We will assume throughout that the reader
has basic statistical training and is familiar with sensory
discrimination testing to the level of \citep[][chapter 7]{naes10}.

\section{Classification of sensory discrimination protocols}
\label{sec:class-sens-discr}

%% \begin{itemize}
%% \item simple binomial response
%% \item compound binomial response
%% \item multinomial response
%% \item comparison of distances versus skimming strategy
%% \item decision strategy versus decision rule
%% \item protocols with response bias
%% \item Forced choice methods
%% \item ``nature of difference'' or ``sensory/perceptual dimension''
%%   required
%% \end{itemize}

The most common and simplest discrimination protocols comprise the
2-AFC, 3-AFC, triangle, duo-trio, A-not A and same-different
protocols. The first four protocols are designed such that the
response follows a binomial distribution in the simplest experimental
setting. On the other hand responses from A-not A and same-different
protocols are distributed according to a compound or product binomial
distribution in the simplest experimental setting. An extension of the
A-not A method known as the A-not A with sureness is a classical SDT
method which leads to multinomially distributed responses. Similarly
the same-different method extends to the degree-of-difference
protocol also resulting in multinomially distributed responses. An
experiment using one of the first four simple protocols can be
summarized with the proportion of correct responses or similarly the
probability of discrimination or d-prime. The Thurstonian models for
the remaining protocols involve one or more additional parameters each
with their particular cognitive interpretation.

The 2-AFC and 3-AFC protocols are so-called directional protocols
since they require that the nature of the difference (e.g. sweetness)
is provided as part of the assessor instructions. On the other hand
the triangle and duo-trio protocols are not directional since these
protocols are used to test un-specified differences. From a
Thurstonian point of view, the sensory dimension or the perceptual
dimension is fixed in the 2-AFC and 3-AFC methods. The cognitive
decision strategy is consequently assumed different in these two
classes of protocols. When the perceptual dimension is fixed, the
assessors may use the more effective skimming strategy, while
assessors are forced to use the inferior comparison of distances
strategy when using the un-directional protocols.

The A-not A and same-different protocols are methods with so-called
response bias. Response bias refers to the concept that one type of
response is preferred over another despite the sensory distance remains
unchanged. For instance some assessors may prefer the ``A'' response over
the ``not A'' response.

The four simple protocols are without response bias since no response
can be consistently preferred over another without affecting the
discriminative effect. The decision criterion is said to be fixed or
stabilized.



\section{Four common sensory discrimination protocols:\\ 2-AFC, 3-AFC,
  triangle and duo-trio}
\label{sec:simple-discr-meth}

The four common sensory discrimination protocols are often used in
practical applications in the food industry as well as in other
areas. They are also of considerable interest in the scientific
literature about sensory discrimination.

The protocols have one important thing in common from a statistical
perspective: their statistical models can all be described as variants
of the binomial distribution. That is, the answer from any one of
these protocols is either correct or incorrect and the sampling
distribution of answers is therefore a binomial distribution.

For the duo-trio and 2-AFC protocols the \emph{guessing probability}, $p_g$
is 1/2. This means that if there is no discriminative difference
between the products, then the probability of a correct
answers, $p_c$ is one half. Similarly for the triangle and 3-AFC
protocols the guessing probability is 1/3. The four common
discrimination protocols are said to be free of \emph{response bias}
in contrast to the A-not A and same-different protocols.
%% Response bias
%% will be described in section~\ref{sec:not-same-different}.

If we assume for a moment that the population of assessors (be that
judges in an expert panel or consumers) is comprised of ignorants who
are always guessing and discriminators who always discriminate
correctly and provide the appropriate answer (though this will not
always be the \emph{correct} answer). One way to express the sensory
distance of the objects (or discriminative ability of the assessors
--- we will treat these viewpoints synonymously throughout)
is the \emph{proportion of discriminators}, $p_d$ in the population of
interest. It is almost always an unreasonable assumption that some
assessors are either always discriminating or always guessing
\citep{ennis93}, but we
may still talk about the \emph{probability of discrimination}. This
probability may refer to particular individuals or to a population; in
this section we will adopt a population perspective.
%% and in
%% section~\ref{sec:repl-simple-discr} we will include an individual
%% perspective.

The relation between the probability of a correct answer and the
probability of discrimination is
\begin{equation}
  \label{eq:1}
  p_c = p_g + p_d (1 - p_g),
\end{equation}
where the guessing  probability, $p_g$ is 1/2 for the duo-trio and
2-AFC protocols and 1/3 for the triangle and 3-AFC protocols. The
reverse relation is
\begin{equation}
  \label{eq:2}
  p_d = (p_c - p_g) / (1 - p_g) .
\end{equation}

Another way to summarize the sensory distance is through a measure
known as $d'$ (pronounced ``d-prime'') from signal detection theory
\citep[SDT,][]{green66, macmillan05}, or equivalently \emph{the Thurstonian delta},
$\delta$ \citep{thurstone27, thurstone27b, thurstone27c}.
These two concepts are identical and will be
used synonymously throughout, and they are actually based on the same
underlying psychophysical model for the cognitive process. Whereas
$p_c$ is a measure and parameter completely free of reference to any
particular discrimination protocol, $p_d$ depends on the
discrimination protocol through the guessing probability, but $d'$
depends on the discrimination protocol through the so-called
\emph{psychometric function}, for the discrimination protocol. The
psychometric function maps from $d'$ to the probability of a correct
answer:
\begin{equation}
  \label{eq:3}
  p_c = f_{ps}(d').
\end{equation}
For the $m$-AFC method, where $m$ denotes the number of ``forced
choices'', the psychometric function is given by
\begin{equation}
  \label{eq:45}
  f_{m\textup{-AFC}}(d') = \int_{-\infty}^{\infty} \phi(z - d')
  \Phi(z)^{m-1} ~\textup{d} z,
\end{equation}
where $\phi$ is the standard normal probability density function and
$\Phi$ is the standard normal cumulative distribution function.
The psychometric functions for the four common discrimination
protocols are given by
\begin{align}
  \label{eq:43}
  f_{\textup{3-AFC}}(d') =&~ \int_{-\infty}^{\infty} \phi(z-d')
  \Phi(z)^{2} ~\textup{d}z \\
  f_{\textup{2-AFC}}(d') =&~ \int_{-\infty}^{\infty} \phi(z-d')
  \Phi(z) ~\textup{d}z  = \Phi ( d' / \sqrt{2} ) \\
  f_{\textup{tri}}(d') =&~
  2 \int_{0}^{\infty} \Big\{ \Phi \left[ -z \sqrt{3} + d' \sqrt{2/3}
  \right]  + \Phi \left[ -z \sqrt{3} - d' \sqrt{2/3} \right]
  \Big\} \phi(z) ~\textup{d}z \\
  f_{\textup{duo-trio}}(d') =&~
  1 - \Phi( d' / \sqrt{2}) - \Phi( d' / \sqrt{6}) + 2 \Phi( d' /
  \sqrt{2}) \Phi( d' / \sqrt{6}).
\end{align}
Further information can be found in \citet{ennis93} and
\citet{brockhoff10}.

The relations between the three scales at which a sensory difference
is described are illustrated in
Fig.~\ref{fig:psychometricFunctions}. In the relation between $p_d$
and $d'$ the alternative forced choice protocols behave similarly,
while the duo-trio and triangle protocols behave similarly. The
gradient of the psychometric functions (cf. eq.~\eqref{eq:10}) goes to
zero when $d'$ goes to zero for the duo-trio and triangle
protocols.

<<scaleRelations>>=

pd <- seq(0, 1-1e-5, length = 1e2)
coef.twoAFC <- coef(rescale(pd = pd, method = "twoAFC"))
coef.threeAFC <- coef(rescale(pd = pd, method = "threeAFC"))
pd <- seq(0, 1-1e-2, length = 1e2)
coef.triangle <- coef(rescale(pd = pd, method = "triangle"))
coef.duotrio <- coef(rescale(pd = pd, method = "duotrio"))

## OK:
tail(coef.duotrio)
tail(coef.twoAFC)
tail(coef.threeAFC)
tail(coef.triangle)

dPrime <- seq(1e-5, 6, len = 1e2)
gd1 <- psyderiv(dPrime, method = "twoAFC")
gd2 <- psyderiv(dPrime, method = "threeAFC")
gd3 <- psyderiv(dPrime, method = "duotrio")
gd4 <- psyderiv(dPrime, method = "triangle")

@

\setkeys{Gin}{width=.49\textwidth}
\begin{figure}
  \centering
<<psyFunsFig, fig=TRUE, width=5, height=5>>=

plot(c(0, 6), c(.3, 1), type = "n", xlab = "d-prime",
     ylab = "P(correct answer)", axes = FALSE)
axis(1)
axis(2, las = 1)
with(coef.twoAFC, lines(d.prime, pc, lty = 1))
with(coef.threeAFC, lines(d.prime, pc, lty = 2))
with(coef.duotrio, lines(d.prime, pc, lty = 3))
with(coef.triangle, lines(d.prime, pc, lty = 4))
legend("topleft", legend = c("2-AFC", "3-AFC", "duo-trio",
                        "triangle"), lty = 1:4, bty = "n")

@
<<psyFunsFig_dprimePd, fig=TRUE, width=5, height=5>>=

plot(c(0, 6), c(0, 1), type = "n", xlab = "d-prime",
     ylab = "P(discrimination)", axes = FALSE)
axis(1)
axis(2, las = 1)
with(coef.twoAFC, lines(d.prime, pd, lty = 1))
with(coef.threeAFC, lines(d.prime, pd, lty = 2))
with(coef.duotrio, lines(d.prime, pd, lty = 3))
with(coef.triangle, lines(d.prime, pd, lty = 4))
legend("topleft", legend = c("2-AFC", "3-AFC", "duo-trio",
                        "triangle"), lty = 1:4, bty = "n")

@
<<psyFunsFig_PcPd, fig=TRUE, width=5, height=5>>=

plot(c(0.3, 1), c(0, 1), type = "n",
     ylab = "P(correct answer)",
     xlab = "P(discrimination)", axes = FALSE)
axis(1); axis(2, las = 1)
segments(c(1/3, .5), 0, 1, 1, lty = 1:2)
legend("topleft", legend = c("3-AFC and triangle", "2-AFC and duo-trio"),
       lty = 1:2, bty = "n")

@
<<psyFunFig_psyDeriv, fig=TRUE, width=5, height=5>>=

plot(c(0, 6), c(0, .31), type = "n", axes = FALSE,
     xlab = "d-prime",
     ylab = "Grad(psychometic function)")
axis(1); axis(2, las = 1)
lines(dPrime, gd1, lty = 1)
lines(dPrime, gd2, lty = 2)
lines(dPrime, gd3, lty = 3)
lines(dPrime, gd4, lty = 4)
legend("topright", legend = c("2-AFC", "3-AFC", "duo-trio",
                     "triangle"), lty = 1:4, bty = "n")

@
  \caption{The connection between $d'$, $p_c$ and $p_d$ for the four
    common sensory discrimination protocols. The so-called
    psychometric functions; $P_c$ as a function of $d'$, are shown in
    the upper left figure.}
  \label{fig:psychometricFunctions}
\end{figure}

The result of a simple discrimination protocol is a number of correct
answers, $X = x$ out of $n$ trials. Under the assumption of independent
observations, the sampling distribution of $X$ is the binomial:
\begin{equation}
  \label{eq:4}
  X \sim \textup{Binom}(p_c, n),
\end{equation}
so
\begin{equation}
  \label{eq:11}
  P(X = x) = {n \choose x} p_c^x (1 - p_c)^{n - x} .
\end{equation}

There is a subtle but important distinction between the
\emph{proportion} of a correct answer and the \emph{probability} of a
correct answer. The proportion of correct answers is $x/n$ which can
be any number between 0 and 1. The probability of a correct answer,
which we denote by $p_c$, is a parameter and represents a true
underlying value. As such $p_c$ cannot be lower than the guessing
probability for the discrimination protocol that was used and cannot
exceed 1. The usual estimator of a binomial probability is just the
sample proportion, $x/n$, but this is not the case here, and it is
exactly this feature that makes discrimination testing interesting
statistically.

The maximum likelihood (ML) estimator\footnote{Following standard
  statistical practice we use the hat-notation to denote an estimator
  or an estimate} of $p_c$ is given by:
\begin{equation}
  \label{eq:5}
  \hat p_c = \left\{
  \begin{array}{lll}
    x/n & \textup{if} & x/n \geq p_g \\
    p_g & \textup{if} & x/n < p_g \\
  \end{array} \right.
\end{equation}
The ML estimator of $p_d$ is given by application of
eq.~\eqref{eq:2}, and the ML estimator of $d'$, by inversion of
eq.~\eqref{eq:3}, given by
\begin{equation}
  \label{eq:6}
  \hat d' = f_{ps}^{-1}(\hat p_c),
\end{equation}
where $f_{ps}^{-1}(\cdot)$ (which should not be confused with
$f_{ps}(\cdot)^{-1} = 1 / f_{ps}(\cdot)$)  is the inverse psychometric
function.

The allowed ranges (parameter space) for these three parameters are
given by
\begin{equation}
  \label{eq:29}
  d' \in [0, \infty [, \quad p_d \in [0, 1], \quad p_c \in [p_g, 1 ].
\end{equation}
Negative $d'$ values are sometimes mentioned in the literature, but
negative $d'$ values are not possible in the discrimination protocols
that we consider here. They are possible in preference tests
and theoretically possible in Thurstonian models based on other
assumptions, see section XXX for more background information on this
topic.


\subsubsection{Implementation in \textsf{sensR}}
\label{sec:implementation-1}

In package \textsf{sensR} there is a function \texttt{rescale} that
maps between the three scales; $p_c$, $p_d$ and $d'$. A value on one of
these scales is given as argument and values on all three scales are
given in the results. The results respect the allowed ranges of the
parameters in eq.~\eqref{eq:29}, so if the supplied $p_c$ is less than
$p_g$, then $p_c = p_g$ is returned with $p_d$ and $d'$ at the
appropriate levels:
<<rescaleExample, results=verbatim, echo=TRUE>>=
rescale(pc = 0.25, method = "triangle")
@

Function \texttt{rescale} use a number of auxiliary functions for its
computations; these are also directly available to the package user:
\begin{itemize}
\item \texttt{pc2pd}: maps from the $p_c$-scale to the $p_d$-scale.
\item \texttt{pd2pc}: maps from the $p_d$-scale to the $p_c$-scale.
\item \texttt{psyfun}: implements the psychmetric functions $p_c =
  f_{ps}(d')$ for the four common discrimination protocols,
  cf. eq.~\eqref{eq:3}.
\item \texttt{psyinv}: implements the inverse psychometric functions,
  $d' = f_{ps}^{-1}(p_c)$ for the four common discrimination
  protocols, cf. eq.~\eqref{eq:6}.
\item \texttt{psyderiv}: implements the derivative of the psychometric
  functions, $f_{ps}'(d')$ for the four common discrimination protocols.
\end{itemize}

\subsection{Inference in simple discrimination protocols}
\label{sec:infer-simple-discr}

To obtain inference in simple discrimination protocols, we need
measures such as standard errors, confidence intervals (CIs) and
$p$-values from significance tests.

\subsubsection{Standard errors}
\label{sec:standard-errors}

The standard error of $p_c$ is given by:
\begin{equation}
  \label{eq:7}
  \textup{se}(p_c) = \sqrt{p_c (1 - p_c) / n}.
\end{equation}

The standard error of $p_d$ and $d'$ can be found by application of
the Delta method \citep[see for example][]{pawitan01}:
\begin{equation}
  \label{eq:8}
  \textup{se}\{f(x)\} = \frac{\partial f(x)}{\partial x} \textup{se}(x)
\end{equation}

The standard error of $p_d$ is therefore
\begin{equation}
  \label{eq:9}
  \textup{se}(p_d) = \frac{1}{1 - p_g} \textup{se}(p_c)
\end{equation}
since $\partial p_d / \partial p_c = 1 / (1 - p_g)$,
cf. eq.~\eqref{eq:2}.
The standard error of $d'$ can similarly be found as
\begin{equation}
  \label{eq:10}
  \textup{se}(d') = \frac{\partial f_{ps}^{-1}(p_c)}{\partial p_c}
  \textup{se}(p_c) = \frac{1}{f_{ps}'(d')} \textup{se}(p_c)
\end{equation}
where $f_{ps}'(d')$ is the derivative of the psychometric function
with respect to $d'$; expressions are given by \citet{brockhoff10}.

Standard errors are only defined and only meaningful as measures of
uncertainty when the parameter estimate is at the interior of the
parameter space, i.e. when the parameter estimate is not at the
boundary of its allowed range, cf. eq.~\eqref{eq:29}.

Even when the parameter estimate is close, in some sense, to the boundary
of its parameter space, the standard error is not a meaningful measure
of uncertainty, because the uncertainty is in fact asymmetric. This
means that symmetric confidence intervals based on the standard error
will also be misleading and other techniques should be applied.


\subsubsection{The likelihood function}
\label{sec:likelihood-function}

The (log-)likelihood function can be used to obtain likelihood ratio
or likelihood root statistics for hypothesis tests, and it can be used
to construct confidence intervals with good properties.

The log-likelihood function for a model based on the binomial
distribution is given by
\begin{equation}
  \label{eq:12}
  \ell(p_c; x, n) = C + x \log p_c + (n - x) \log (1 - p_c),
\end{equation}
where $C = \log {n \choose x}$ is a constant with respect to $p_c$. The
log-likelihood function for $p_d$ or $d'$ is given by combining
eq.~\eqref{eq:12} with \eqref{eq:2} or \eqref{eq:6}.

In general, standard errors can be found as the square root of
the diagonal elements of the variance-covariance matrix of the
parameters. The variance-covariance matrix can be found as the inverse
of the negative Hessian matrix (the matrix of second order derivatives)
of the log-likelihood function evaluated at the ML estimates. Here
there is only one parameter (either one of $p_c$, $p_d$ or $d'$), so
the matrices are merely scalars.

It can be shown that the same standard errors as those derived in
eq. \eqref{eq:7}, \eqref{eq:9} and \eqref{eq:10} can be derived by
differentiating \eqref{eq:12} twice and using the chain rule to obtain
the standard errors of $p_d$ and $d'$.


\subsubsection{Confidence intervals}
\label{sec:confidence-intervals}

There are several general approaches to get CIs for parameters. One
general way that applies (with varying success) to almost all
parameters with a standard error is the traditional Wald interval:
\begin{equation}
  \label{eq:13}
  CI:~\hat\mu \pm z_{1 - \alpha/2} \textup{se}(\hat\mu),
\end{equation}
where $z_{1 - \alpha/2} = \Phi^{-1}(1 - \alpha/2)$ is the upper
$\alpha / 2$ quantile of the standard normal distribution. This CI
is based on the Wald statistic\footnote{actually the original
  definition used $\textup{se}(\mu_0)$ in the denominator.}:
\begin{equation}
  \label{eq:20}
  w(\mu_0) = (\hat\mu - \mu_0) / \textup{se}(\hat\mu).
\end{equation}
The CI may also be expressed more generally for a statistic $t(\mu_0)$ that
follows a standard normal distribution under the null hypothesis as:
\begin{equation}
  \label{eq:14}
  CI:~ \{\mu; |t(\mu)| < z_{1 - \alpha/2} \}.
\end{equation}
Using $w$ as $t$ in \eqref{eq:14} gives the interval \eqref{eq:13}.

Another general approach is to use the likelihood root statistic
(inverted likelihood ratio test) which applies to all likelihood based
models and almost always impressively successful.
The likelihood root statistic is given by:
\begin{equation}
  \label{eq:15}
  r(\mu_0) = \textup{sign}(\hat\mu - \mu_0) \sqrt{2 \left\{
    \ell(\hat\mu; x) - \ell(\mu_0; x) \right\} }
\end{equation}
Both the Wald and likelihood root statistics asymptotically follow
standard normal distributions under the null hypothesis. Even though
their asymptotic behavior is in fact identical, their finite
sample properties may be quite different and often favor the
likelihood root statistic since it removes nonlinear parameterization
effects.

A disadvantage of Wald intervals is that they are not invariant to
nonlinear transformations of the parameter. This means that a Wald CI
for $p_c$ and a Wald CI for $d'$ provides different kinds of evidence
about the parameters and could, for instance, lead to inclusion of
$p_g$ in the CI on the $p_c$ scale, but exclusion of $d' = 0$ on the
$d'$ scale. More generally the Wald CI for $p_c$ cannot be found by
transforming the Wald CI limits for $d'$ through the psychometric
function. The CIs based on the likelihood root statistic is on the
other hand invariant to nonlinear transformations of the
parameter. This means the likelihood CI for $d'$ can be found by
either computing the likelihood CI for $d'$ directly or by
transforming the limits of the likelihood CI for $p_c$ through the
inverse psychometric function --- they give the same answer. The evidence
provided by the likelihood CI is therefore invariant to the choice of
scale.

Another approach to generate CIs consistent across parameter scales
would be to compute an appropriate CI for, say, $p_c$ and then
transform the CI limits through the appropriate functions to obtain
CIs for $p_d$ and $d'$. For likelihood CIs this does not make any
difference, of course. If an appropriate CI can be computed on any one
scale, this would provide appropriate CIs on the other scales as
well. There exists a wide range of CIs for the binomial probability
parameter (refs), for instance the score interval and the so-called
exact interval in addition to the Wald and likelihood intervals.

The 'exact' binomial interval is given by inversion of the 'exact'
binomial test and known as the Clopper-Pearson interval
\citep{clopper34}. The lower and upper limits are defined as the
values of $p_c$ that solve:
\begin{equation}
  \label{eq:44}
  LL:~P(X \geq x) = \alpha / 2, \quad
  UL:~P(X \leq x) = \alpha / 2,
\end{equation}
where $X \sim \textup{binom}(p_c, n)$. Rather than solving these
equations numerically, the limits can be found directly as quantiles
of the beta
distribution, $\textup{Beta}(a, b)$: the lower limit is the
$\alpha / 2$ quantile of $\textup{Beta}(x, n - x + 1)$ and the upper
limit is the $1 - \alpha / 2$ quantile of $\textup{Beta}(x + 1, n -
x)$.

Another commonly applied statistic is based on the normal
approximation of the binomial distribution. Asymptotically
$(X - n p_c) / \sqrt{n p_c (1 - p_c)}$ behaves like a standard normal
random variable, so we may use
\begin{equation}
  \label{eq:49}
  w^*(p_{c0}) = \frac{x - n p_{c0}}{\sqrt{n p_{c0} (1 - p_{c0})}},
\end{equation}
as test statistic. This statistic is in fact
identical to the Wald statistic~\eqref{eq:20} if $\textup{se}(\mu_0)$
is used in the denominator instead of $\textup{se}(\hat\mu)$.

The statistic $w^*$ is related to the Pearson $\chi^2$ statistic
\begin{equation}
  \label{eq:50}
  X^2(p_{c0}) = \frac{(x - n p_{c0})^2}{n p_{c0}} +
  \frac{(n - x - n (1 - p_{c0}))^2}{n(1 - p_{c0})}
\end{equation}
since $w*$ is the signed square root of $X^2$. Similarly the
likelihood root statistic, $r(p_{c0})$ is related to the likelihood
ratio statistic
\begin{equation}
  \label{eq:51}
  G^2(p_{c0}) = x \log \frac{x}{n p_{c0}} + (n - x) \log \frac{n - x}{n(1 - p_{c0})}
\end{equation}
since $r(p_{c0})$ is the signed square root of $G^2(p_{c0})$.


\subsubsection{Sensory difference tests}
\label{sec:sens-diff-tests}

A sensory difference test is a test of
\begin{equation}
  \label{eq:16}
  H_0:
  \begin{array}{l}
    p_c \leq p_{c0} \\
    p_d \leq p_{d0} \\
    d' \leq d'_0   \\
  \end{array} \quad \textup{versus} \quad
  H_A:
  \begin{array}{l}
    p_c > p_{c0} \\
    p_d > p_{d0} \\
    d' > d'_0 \\
  \end{array},
\end{equation}
where the traditional tests of no-difference is given by choosing
$p_{c0} = p_g$, $p_{d0} = 0$ and $d'_0 = 0$ making the null hypothesis
an equality rather than an inequality.

The $p$-value of a difference test is the probability of
observing a number of successes that is as large or larger than that
observed given the null hypothesis that the probability of a correct
answer is $p_{c0}$. The $p$-value based on the 'exact' binomial test
is therefore:
\begin{equation}
  \label{eq:21}
  p\textup{-value} = P(X \geq x) = 1 -
  \sum_{i=0}^{x-1} {n \choose i} p_{c0}^i (1 - p_{c0})^{n - i}~,
\end{equation}
where $X \sim \textup{binom}(p_{c0}, n)$

The $p$-value for a difference based on a statistic, $t(\mu_0)$ that
follows a standard normal distribution under the null hypothesis is
given by:
\begin{equation}
  \label{eq:47}
  p\textup{-value} = P\{Z \geq t(\mu_0)\} = 1 - \Phi\{ t(\mu_0) \},
\end{equation}
where $Z$ is a standard normal random variable and $\Phi$ is the
standard normal cumulative distribution function.


\subsubsection{Sensory similarity tests}
\label{sec:sens-simil-tests}

A sensory similarity test is a test of
\begin{equation}
  \label{eq:19}
  H_0:
  \begin{array}{l}
    p_c \geq p_{c0} \\
    p_d \geq p_{d0} \\
    d' \geq d'_0
  \end{array} \quad \textup{versus} \quad
  H_A:
  \begin{array}{l}
    p_c < p_{c0} \\
    p_d < p_{d0} \\
    d' < d'_0
  \end{array},
\end{equation}
where subject matter considerations and possibly power computations
will guide the choice of
$p_{c0}$, $p_{d0}$ or $d'_0$. Observe that $d'_0$ has to be positive
for the test to make sense.

The $p$-value of a similarity test is the probability of
observing a number of successes that is as large or less than that
observed given the null hypothesis that the probability of a correct
answer is $p_{c0}$.
The $p$-value based on the 'exact' binomial test
is therefore:
\begin{equation}
  \label{eq:22}
  p\textup{-value} = P(X \leq x) =
  \sum_{i = 0}^x {n \choose i} p_{c0}^i (1 - p_{c0})^{n - i}~,
\end{equation}
where $X \sim \textup{binom}(p_{c0}, n)$

The $p$-value for a difference based on a statistic, $t(\mu_0)$ that
follows a standard normal distribution under the null hypothesis is
given by:
\begin{equation}
  \label{eq:48}
  p\textup{-value} = P\{Z \leq t(\mu_0)\} = \Phi\{ t(\mu_0) \},
\end{equation}

%% If a meaningful difference is characterized by a one-in-five
%% probability of discrimination for a triangle test, then $p_{d0} = 0.2$,
%% so $p_{c0} = 1/3 + 2/3 p_{d0} = 7/15 \approx 0.467$.

\subsubsection{Confidence intervals and hypothesis tests}
\label{sec:conf-interv-hypoth}

Confidence intervals are often described by their relation to
hypothesis tests such that a two-sided hypothesis test should be
accompanied by a two-sided confidence interval and one-sided
hypothesis tests should be accompanied by one-sided confidence
intervals. This will make the $1 - \alpha$ level confidence interval
the region in which an observation would not lead to rejection of the
null hypothesis. A confidence interval should, however, provide more
than a rejection region; it should provide an interval in which we can
have confidence that the true parameter lies. This corresponds to the
interval which provides most support for the parameter. As such
confidence intervals should be two-sided even if the appropriate test
may be one-sided \citep{boyles08}. We will use two-sided confidence
intervals throughout and use these in conjunction with $p$-values from
one-sided difference and similarity tests. This is also implemented in
\textsf{sensR}.

Confidence intervals may, however, be one-sided in a slightly
different respect since it may happen, for instance, that the lower
confidence limit is at the guessing probability, $p_g$. If the
observed proportion of correct answers is less than $p_g$, the lower
confidence limit will also be higher than the observed proportion.

Confidence intervals may be degenerate in the sense that both limits
can be zero; this is obviously not very informative. This may happen
if, for instance, the observed proportion is below $p_g$ and $\alpha$
is large enough. For small enough $\alpha$, the upper confidence limit
for $d'$ will, however, exceed zero.

Confidence intervals can be used for difference and similarity
testing as argued by \citet{MacRae95} and \citet{Carr95} when it is
enough to know if the alternative hypothesis is
rejected or not. Comparing the formulas for the 'exact' Clopper-Pearson
confidence limits \eqref{eq:44} with the formulas for $p$-values in
difference and similarity tests also based on the exact test, it is
clear that there is a close connection.

If $p_{c0}$ under $H_0$ is below the lower confidence limit in a $1 -
\alpha$ level interval, then the $p$-value of a difference test will
be below $\alpha/2$, i.e. the test will be significant at the
$\alpha/2$-level. Thus, if $p_{c0}$ is below the lower confidence
limit in a 90\% interval, then the difference test is significant at
the 5\% level. Similarly, if $p_{c0}$ is above the upper confidence
limit in a 90\% interval, then the similarity test is significant at
the 5\% level.

In difference testing the binomial test is not too liberal even if
there is variability in $p_d$ under the alternative hypothesis,
because there can be no variability under the null hypothesis that
$p_{d} = 0$. In similarity testing, however, $p_d > 0$ under $H_0$ and
the standard binomial test could possibly be liberal. Also not that $p_d$
under $H_A$ will be less than $p_d$ under $H_0$, and if there is
variation in $p_d$ in the distribution, this variation could be larger
under $H_0$ than under $H_A$. Also, the power and sample size
computations in the following assume that zero variability in
$p_d$. Possibly the power will be lower and sample sizes higher if
there really is variation in $p_d$ in the population.

The similarity tests discussed so far are targeted toward equivalence
in the population on average. There is no consideration of equivalence
on the level of individual discrimination.

A general problem with discrimination testing outlined so far is the
assumption that all assessors have the same probability of
discrimination. This is hardly ever a priory plausible. The so-called
guessing model (refs) assumes that there are two kinds of assessors;
non-discriminators that always guess and true discriminators that
always perceive the difference and discriminate correctly. This
assumption is also hardly ever a priory plausible. More plausible is
perhaps that the probability of discrimination has some distribution
across the population of assessors as is assumed in the
chance-corrected beta-binomial distribution.


\subsubsection{Implementation in \textsf{sensR}}
\label{sec:implementation-2}

The function \texttt{rescale} that was described in
section~\ref{sec:implementation-1} has an additional optional argument
\texttt{std.err} which allows one to get the standard error of, say,
$p_d$ and $d'$ if the standard error of $p_c$ is supplied. This is
done through application of eq.~\eqref{eq:9} and \eqref{eq:10} and by
using the user visible function \texttt{psyderiv}, which implements
the derivative of the psychometric functions, $f_{ps}'(d')$ for the
four common discrimination protocols:
<<rescaleExample2, results=verbatim, echo=TRUE>>=
rescale(pd = 0.2, std.err = 0.12, method = "triangle")
@

The \texttt{discrim} function is the primary function for inference in
the duo-trio, triangle, 2-AFC and 3-AFC protocols. Given the number of
correct answers, $x$ and the number of trials, $n$, \texttt{discrim}
will provide estimates, standard errors and confidence intervals on
the scale of $p_c$, $p_d$ and $d'$. It will also report the $p$-value
from a difference or similarity test of the users choice. $p$-values
will be one-sided while confidence limits will be two-sided,
cf. section~\ref{sec:conf-interv-hypoth}. Confidence intervals are
computed on the scale of $p_c$ and then transformed to the $p_d$ and
$d'$ scales as discussed in section
\ref{sec:confidence-intervals}. The user can choose between several
statistics including the 'exact' binomial, likelihood, Wald and
score statistics. The score option leads to the so-called Wilson or
score interval, while the $p$-value is based on the $w^*$ statistic,
cf. eq.~\eqref{eq:49}.

Estimates and confidence intervals reported by \texttt{discrim}
respect the allowed range of the parameters, cf. eq.~\eqref{eq:29} and
standard errors are not reported if the parameter estimates are on the
boundary of their parameter space (allowed range).

Strictly speaking the Wald statistic~\eqref{eq:20} is not defined when
$x/n \leq p_g$, since the standard error of $\hat p_c$ is not
defined. However, it makes sense to use
$\sqrt{\frac{x}{n} \left( 1 - \frac{x}{n} \right) \frac{1}{n}}$ as
standard error in this case. This is adopted in \texttt{discrim}.

Similarity testing does not make sense if $p_{c0} = 0$ under the null
hypothesis, cf. eq.~\eqref{eq:19}, so a positive $p_{d0}$ has to be
chosen for similarity testing.

\paragraph{Example:}

Suppose we have performed a 3-AFC discrimination test and observed 10
correct answers in 15 trials. We want estimates of the $p_c$, $p_d$
and $d'$, their standard error and 95\% confidence intervals. We are
also interested in the difference test of no difference and decide to
use the likelihood root statistic for confidence intervals and tests.
Using the \texttt{discrim} function in \textsf{R} we obtain:
<<discrimExample1, results=verb, echo=TRUE>>=
discrim(10, 15, method = "threeAFC", statistic = "likelihood")
@

If instead we had observed 4 correct answers in 15 trials and were
interested in the similarity test with $p_{d0} = 1/5$ under the null
hypothesis, we get using the 'exact' binomial criterion for confidence
intervals and tests:
<<discrimExample2, results=verb, echo=TRUE>>=
discrim(4, 15, method = "threeAFC", test = "similarity", pd0 = 0.2,
        statistic = "exact")
@

A few auxiliary methods for \texttt{discrim} objects are
available. \texttt{confint} returns the confidence intervals computed
in the \texttt{discrim} object, \texttt{profile} extracts the
(profile) likelihood function and \texttt{plot.profile} plots the
likelihood function.

\paragraph{Example}

To illustrate the auxiliary methods consider the 3-AFC example above
where 10 correct answer were observed in 15 trials.
<<discrimExample3, results=verb, echo=TRUE, fig=TRUE, include=FALSE>>=
fm1 <- discrim(10, 15, method = "threeAFC", statistic = "exact")
confint(fm1)
plot(profile(fm1))
@
The resulting graph is shown in
Fig.~\ref{fig:plotProfileExample}. Observe that the likelihood
(profile) function may be extracted from a \texttt{discrim} object
that is not fitted with \texttt{statistic = "likelihood"}. Further
information about the use and interpretation of (profile) likelihood
curves in sensory experiments is given in \citep{brockhoff10,
  christensen09}.

\setkeys{Gin}{width=.49\textwidth}
\begin{figure}
  \centering
<<discrimExample3Fig, fig=TRUE, include=TRUE>>=
<<discrimExample3>>
@
  \caption{Relative likelihood function for a 3-AFC experiment with 10
correct answers in 15 trials. The maximum likelihood estimate is at
$d' = 1.12$ and the two horizontal lines determine the 95\% and 99\%
likelihood based confidence intervals.}
\label{fig:plotProfileExample}
\end{figure}


\subsection{Sample size and power calculations for simple
  discrimination protocols}
\label{sec:sample-size-power}

The power of a test is the probability of getting a significant result
for a particular test given data, significance level and a particular
difference. In other words, it is the probability of observing a
difference that is actually there.
Power and sample size calculations require that a model under the null
hypothesis and a model under the alternative hypothesis are decided
upon. The null model is often implied by the null hypothesis and is
used to calculate the critical value. The alternative model has to lie
under the alternative hypothesis and involves a subject matter
choice. Power is then calculated for that particular choice of
alternative model.

In the following we will consider
calculation of power and sample size based directly on the binomial
distribution. Later we will consider calculations based on a normal
approximation and based on simulations.

\subsubsection{The critical value}
\label{sec:critical-value}


Formally the critical value, $x_c$ of a one-sided binomial test where
the alternative hypothesis is \emph{difference}, or equivalently
\emph{greater},  is the smallest integer number that satisfies
\begin{equation}
  \label{eq:30}
  P(X \geq x_c) \leq \alpha \quad \textup{where}
  \quad X \sim \textup{binom}(p_{c0}, n)
\end{equation}
and $p_{c0}$ is the probability of a correct answer under the null
hypothesis.
Similarly the critical value, $x_c$ of a one-sided binomial test where
the alternative hypothesis is \emph{similarity}, or equivalently
\emph{less}, is the largest integer number that satisfies
\begin{equation}
  \label{eq:31}
  P(X \leq x_c) \leq \alpha \quad \textup{where}
  \quad X \sim \textup{binom}(p_{c0}, n)
\end{equation}

If the sample size is small for the desired $\alpha$, there may not be
a possible critical value that satisfies~\eqref{eq:30} or
\eqref{eq:31}. In a difference test it may not be enough to observe
$x = n$ correct answers, i.e. all correct answers for the test to be
significant at the required $\alpha$. Similarly, it may not be enough
to observe no correct answers ($x = 0$) for the similarity test to be
significant at the required $\alpha$.

A simple way to compute $x_c$ is to use a small while loop (shown here
for a difference test):
\begin{algorithmic}
  \STATE $i = 0$
  \WHILE{$P(X \geq i) > \alpha$}
  \STATE $i = i + 1$
  \ENDWHILE
  \RETURN $i + 1$
\end{algorithmic}
However, if $x_c$ is a large number, many iterations of the loop would
be required, so instead in the \texttt{findcr} function in package
\textsf{sensR} eq.~\eqref{eq:30} and \eqref{eq:31} are solved
numerically for $x_c$. One complication
with this method is that $P(X \geq x_c)$ is discontinuous in $x_c$
and that requires special attention.


\paragraph{Example:}

Consider the situation that $X = 15$ correct answers are observed out
of $n = 20$ trials in a duo-trio test.
The exact binomial $p$-value of a no-difference test is
$P(X \geq 15) = 1 - P(X \leq 15-1) = 0.021$, where $X \sim
\textup{binom}(0.5, 20)$ so this is significant. If on
the other hand we had observed $X = 14$, then the $p$-value would have
been $P(X \geq 14) = 0.058$, which is not
significant. We say that $x_c = 15$
is the \emph{critical value} for this particular test on the
$\alpha = 5\%$ significance level because $x_c = 15$ is the smallest
number of correct answers that renders the test significant.

In \textsf{R} we can find the $p$-values with
<<criticalValueExample1, results=verb, echo=TRUE>>=
1 - pbinom(q = 15 - 1, size = 20, prob = 0.5)
1 - pbinom(q = 14 - 1, size = 20, prob = 0.5)
@

The while loop looks like
<<criticalValueExample2, results=verb, echo=TRUE>>=
i <- 0
while (1 - pbinom(q = i, size = 20, prob = 0.5) > 0.05)
  {
    i <- i + 1
  }
i + 1
@

while we could also use the \texttt{findcr} function in package
\textsf{sensR}:
<<criticalValueExample3, results=verb, echo=TRUE>>=
findcr(sample.size = 20, alpha = 0.05, p0 = 0.5)
@



\subsubsection{The power of difference tests}
\label{sec:power-diff-tests}

The power of a difference test is
\begin{equation}
  \label{eq:32}
  \textup{power} = P(X \geq x_c) \quad \textup{where}
  \quad X \sim \textup{binom}(p_{cA}, n) ,
\end{equation}
where $p_{cA}$ is the probability of a correct answer under the
alternative hypothesis and $x_c$ is the critical value of the test,
which depends on the probability of a correct answer under the null
hypothesis and the significance level, $\alpha$.

Power increases with the difference between $p_{c0}$ and $p_{cA}$,
the sample size and $\alpha$. Power can be computed directly once the
critical value, $p_{cA}$
and $n$ are known, so the only computational challenge is in the
computation of the critical value.

\paragraph{Example:}

The power of the test considered in the previous example is the
probability of getting this $p$-value or
one that is smaller. This depends on the actual sensory difference of
the objects/the proportion of discriminators. If half the population
are discriminators or equivalently if each assessor has a 50\% of
correctly discriminating a set of samples, then
$p_c = 1/2 + 1/2 p_d = 3/4$. The power is the probability of observing
15 or more correct answers:
\begin{equation}
  \label{eq:24}
  \textup{power} = P(X \geq 15) = 1 - P(X \leq 15-1) = 0.617
  \quad \textup{where} \quad X \sim \textup{binom}(3/4, 20)
\end{equation}
This can be obtained in \textsf{R} with
<<differencePowerExample1, results=verb, echo=TRUE>>=
1 - pbinom(q = 15 - 1, size = 20, prob = 3/4)
@
or directly using the \texttt{discrimPwr} function from
\textsf{sensR}:
<<differencePowerExample2, results=verb, echo=TRUE>>=
discrimPwr(pdA = 0.5, sample.size = 20, alpha = 0.05,
           pGuess = 1/2)
@
Observe that \texttt{discrimPwr} requires that the effect size under
the alternative hypothesis is given in terms of $p_d$ rather than
$p_c$ or $d'$. If the effect size under the alternative hypothesis is
formulated in terms of $d'$, then \texttt{rescale} can be used to convert
from $d'_A$ to $p_{dA}$, but it would be easier to use
\texttt{d.primePwr}, which accepts $d'_A$ directly and internally
calls \texttt{discrimPwr}.

If the significance test of interest is not that of no-difference, but
that of a small difference versus a relevant difference, the
computation of the critical value is slightly different. The
power calculation remain essentially the same.

If the limit between irrelevant and relevant differences is at $p_d =
0.1$, so $p_c = 1/2 + 1/2 \cdot 0.1 = 0.55$, then
$P(X \geq 16| p_{c0} = 0.55, n = 20) = 1 - P(X \leq 16-1) = 0.019$
while
$P(X \geq 15| p_{c0} = 0.55, n = 20) = 1 - P(X \leq 15-1) =
0.055$. The critical value is therefore 16 and the power of the test
is
\begin{equation}
  \label{eq:25}
  \textup{power} = P(X \geq 16) =
  %% 1 - P(X \leq 16-1) =
  0.415
  \quad \textup{where} \quad X \sim
  \textup{binom}(p_{cA} = 3/4, n = 20)
\end{equation}
In \textsf{R} we could get the power of this test with
<<differencePowerExample3, results=verb, echo=TRUE>>=
discrimPwr(pdA = 0.5, pd0 = 0.1, sample.size = 20, alpha = 0.05,
           pGuess = 1/2)
@
Note the \texttt{pd0} argument which should match the value of $p_d$
under the null hypothesis.

\subsubsection{The power of similarity tests}
\label{sec:power-simil-tests}

The power of a similarity test is
\begin{equation}
  \label{eq:32b}
  \textup{power} = P(X \leq x_c) \quad \textup{where}
  \quad X \sim \textup{binom}(p_{cA}, n) ,
\end{equation}
and $p_{cA}$ is the probability of a correct answer under the
alternative hypothesis and $x_c$ is the critical value of the test,
which depends on the probability of a correct answer under the null
hypothesis and the significance level, $\alpha$.

\paragraph{Example:}

Assume that we want to calculate the power of a similarity test using
the duo-trio protocol with $n = 100$, and that we want to show that
the probability of discrimination is less than 1/3,
while we believe that there is actually no difference between the
objects, so the true probability of discrimination is zero.
The null hypothesis is
therefore $H_0:~ p_c \geq 1/2 + 1/2 \cdot 1/3 = 2/3$ and the alternative
hypothesis is $H_A:~ p_c < 2/3$.
The critical value of this test is
$x_c = 58$ since
$p = P(X \leq 58 | p_c = 2/3, n = 100) = 0.042 \leq 0.05$
while $P(X \leq 59) = 0.064 > 0.05$. The power of this test is
therefore
\begin{equation}
  \label{eq:26}
  \textup{power} = P(X \leq 58 | p_c = 0.5, n = 100) = 0.956
\end{equation}
We would compute this power in \textsf{R} with
<<similarityPowerExample1, results=verb, echo=TRUE>>=
discrimPwr(pdA = 0, pd0 = 1/3, sample.size = 100, alpha = 0.05,
           pGuess = 1/2, test = "similarity")
@
If in fact there is a small difference between the objects, so that
there is a positive probability of discrimination, say $p_d = 1/5$,
then the power is (the critical value remains the same):
\begin{equation}
  \label{eq:27}
  \textup{power} = P(X \leq 58 | p_c = 0.5(1 + 1/5), n = 100) = 0.377
\end{equation}
We would compute this power in \textsf{R} with
<<similarityPowerExample1, results=verb, echo=TRUE>>=
discrimPwr(pdA = 1/5, pd0 = 1/3, sample.size = 100, alpha = 0.05,
           pGuess = 1/2, test = "similarity")
@
Observe how the power of the similarity test is quite good if there is
absolutely no observable difference between the objects, while if
there is in fact a small probability that a difference can be
observed, the power is horrible and the sample size far from
sufficient.

\subsubsection{Power calculation based on simulations}
\label{sec:power-calc-simul}

In more complicated models it is not possible to determine an explicit
expression for the power of a test and calculation of power based on
simulations can be an attractive approach. Sometimes it may also just
be easier to let the computer do the job by running simulations rather
than to get bugged down in derivations of explicit expressions for
power even though they may in fact be possible to derive.

Recall that power is the probability of getting a significant result
when there is in fact a difference, thus in the long run it is the
proportion of significant results to the total number of tests:
\begin{equation}
  \label{eq:28}
  \textup{power} = \frac{\textup{no.~} p\textup{-values} < \alpha}
  {\textup{no.~tests}}
\end{equation}

We can let the computer generate random data from the model under the
alternative hypothesis and then
perform the significance test. We can even do that many many times and
record the $p$-values allowing us to calculate the power via
eq.~\eqref{eq:28}. In the following we will do exactly that for a
binomial test for which we know the right answer.

<<powerSimul>>=
## Simulating power of a duo-trio no-difference test:
set.seed(12345)
xx <- rbinom(10000, 20, 3/4)
sum(xx == 0) ## 0
pp <- 1 - pbinom(xx-1, 20, 1/2)
sum(pp <= 0.05)
(power <- mean(pp <= 0.05)) # 0.6184

## Uncertainty (standard error of power):
(se.power <- sqrt(power * (1 - power) / 1e4))
## confidence interval:
power + c(-1,1) * qnorm(.975) * se.power
@

Consider the no-difference example above in
section~\ref{sec:power-diff-tests} where $n = 20$ and the power of a
no-difference test was 0.617 when $p_d = 1/2$, so $p_c = 3/4$. We will
estimate the power via simulation by generating 10,000 (pseudo) random
draws, $X_i$, $i = 1,\ldots, 10,000$ from
$X_i \sim \textup{binom}(p_c = 3/4, n = 20)$. For each of
these draws we calculate the $p$-value as
$p_i = P(X \geq x_i | p_c = 1/2, n = 20)$. Among these $p$-values 6184
were below 0.05, so the power estimated by simulation is
0.6184. Observe that this is close to, but not exactly the power that
we obtained analytically (0.617). If we did the power calculation over
again, we would most likely get a slightly different power estimate
although probably also close to 0.617 because we would obtain a
slightly different set of random draws.
This illustrates that although power calculation via simulation is
simple, the result varies a little from one run to another.

Fortunately we can estimate the uncertainty in the estimated power from
standard binomial principles. The standard error of the estimated
power is
$\textup{se}(\hat{\textup{power}}) = \sqrt{\textup{power} ( 1 -
  \textup{power}) / n_{\textup{sim}}} = \sqrt{0.6814(1 -
  0.6814)/10,000} = 0.0049$ and an approximate Wald 95\% CI for the
estimated power is $[0.609; 0.628]$, which covers the true value
(0.617) as one would expect.

\subsubsection{Power calculation based on the normal approximation}
\label{sec:power-calc-normal-approx}

An often used approximation for power and sample size calculations is
the normal approximation; the idea is to use a statistic that
asymptotically follows a standard normal distribution. For a binomial
parameter power and sample size calculation may be based on the Wald
statistic~\eqref{eq:20} as for example described by \citet{lachin81}
and advocated by \citet{bi06} in a sensometric context. We are not
aware of any numerical assessments of the accuracy of the normal
approximation for power and sample size calculations, but we may
expect that for small $n$ or $p$ (under the null or alternative) close
to one, the approximation may be rather inaccurate. Since power and
sample size determinations are readily available for the exact
binomial test, we see no reason to use approximate statistics with
doubtful properties for these purposes.

Consider the following hypotheses for a binomial parameter:
\begin{equation}
  \label{eq:46}
  H_0:~ p = p_0 \quad H_A:~ p > p_0,
\end{equation}
then under the null hypothesis approximately
\begin{equation}
  \label{eq:52}
  \frac{\hat p - p_0}{\sigma_0} \sim N(0, 1)
\end{equation}
and under the alternative hypothesis approximately
\begin{equation}
  \label{eq:53}
  \frac{\hat p - p_A}{\sigma_A} \sim N(0, 1) ,
\end{equation}
where $p_A$ is the probability under the alternative hypothesis,
$\sigma_0 = \sqrt{p_0(1 - p_0)/n}$,
$\sigma_A = \sqrt{p_A(1 - p_A)/n}$ and $\hat p = X/n$ is the estimator
of a binomial parameter. The critical point above which the
null hypothesis is rejected is then
\begin{equation}
  \label{eq:54}
  \frac{\hat p - p_0}{\sigma_0} > \Phi^{-1}(1 - \alpha) = z_{1 - \alpha}
\end{equation}
i.e. when
\begin{equation}
  \label{eq:55}
  \hat p > z_{1 - \alpha} \sigma_0 + p_0 .
\end{equation}
Under $H_A$ the null hypothesis is rejected if
\begin{equation}
  \label{eq:56}
  \frac{\hat p - p_A}{\sigma_A} > \frac{z_{1 - \alpha} \sigma_0 + p_0
    - p_A}{\sigma_A}
\end{equation}
and the power is
\begin{equation}
  \label{eq:57}
  \textup{power} = P\left( Z > \frac{z_{1 - \alpha} \sigma_0 + p_0 -
      p_A}{\sigma_A}\right ) =
  1 - \Phi \left( \frac{z_{1 - \alpha} \sigma_0 + p_0 - p_A}{\sigma_A}
  \right )
\end{equation}
Equivalent considerations for the equivalence hypotheses lead to
\begin{equation}
  \label{eq:59}
  \textup{power} = P\left( Z < \frac{z_{\alpha} \sigma_0 + p_0 -
      p_A}{\sigma_A}\right ) =
  \Phi \left( \frac{z_{\alpha} \sigma_0 + p_0 - p_A}{\sigma_A}
  \right )
\end{equation}


Isolating $n$ in eq.~\eqref{eq:57} leads to the following expression
for the sample size of difference tests:
\begin{equation}
  \label{eq:58}
  \textup{sample size} = \left ( \frac{z_\beta \sqrt{p_A(1 - p_A)} - z_{1 -
        \alpha}\sqrt{p_0(1 - p_0)}}{p_0 - p_A} \right )^2 ,
\end{equation}
where $z_\beta = \Phi^{-1}(1 - \textup{power})$.
Equivalently for similarity tests:
\begin{equation}
  \label{eq:60}
  \textup{sample size} = \left ( \frac{z_{1 - \beta} \sqrt{p_A(1 -
        p_A)} - z_{\alpha} \sqrt{p_0(1 - p_0)}}{p_0 - p_A} \right )^2 ,
\end{equation}
where $z_{1 - \beta} = \Phi^{-1}(\textup{power})$. The sample sizes
given by \eqref{eq:58} and \eqref{eq:60} should be rounded up to the
nearest integer.


\subsubsection{Sample size determination}
\label{sec:sample-size-determ}

In principle sample size determination is simple;
find the sample size such that the power is sufficiently high
for a particular test at some significance level given some true
difference. Computationally, however, it can be a challenge.

Formally, the required sample size, $n^*$ for a sensory difference test
is the smallest integer number, $n^*$ that satisfies
\begin{equation}
  \label{eq:33}
  P(X \geq x_c) \geq \textup{target-power} \quad \textup{where}
  \quad X \sim \textup{binom}(p_c, n^*) ,
\end{equation}
and $P(X \geq x_c)$ is the \emph{actual power} of the test. Power
for a difference test only increases with increasing sample size if
the true difference, $p_d$ is larger than the null difference, $p_{d0}$,
so it is a requirement that the value of $p_d$ specified as the true
difference is actually covered by the alternative hypothesis.

Similarly, the required sample size, $n^*$ for a similarity test is the
smallest integer number, $n^*$ that satisfies
\begin{equation}
  \label{eq:34}
  P(X \leq x_c) \geq \textup{target-power} \quad \textup{where}
  \quad X \sim \textup{binom}(p_c, n^*) ,
\end{equation}
and $P(X \leq x_c)$ is the \emph{actual power} of the test. Power
only increases with increasing sample size if the true difference,
$p_d$ is less than the null difference, $p_{d0}$, so as for difference
tests, the value specified as the true difference has to be covered by
the alternative hypothesis.

The sample size depends on the particulars of the null and alternative
hypotheses as well as the significance level of the test,
i.e. $\alpha$ and the desired minimum power; the \emph{target-power}.

So much for the formal definitions: practical sample size
determination is in fact not as simple as the definitions may lead one
to believe. Consider a situation in which we want to know which sample
size to choose in a difference test using the triangle protocol where
the null hypothesis is no difference, target power is 0.80, and we
believe the actual difference is $d' = 0.9$ under the alternative
hypothesis. Standard sample size calculations under the
definition~\eqref{eq:33} tells us that 297 tests are enough; this
leads to an actual power of 0.802. However, had we decided to use,
say, 300 tests---for convenience and just to be on the safe side, the
power of the test is only 0.774; much less than the power with 297
tests and below our target power. This is truly worrying; how many
samples do we need to be sure that all larger sample sizes also lead
to a power above 0.80? It is natural to expect power to
increase with every increase in sample size (a monotonic increase in
power with sample size), but this is not the case as is illustrated in
Fig.~\ref{fig:powerSampleSize}.

Power generally increases with the sample size, but it does so in a
zig-zag way due to the discreteness of the binomial distribution. As
is seen
in Fig.~\ref{fig:powerSampleSize}, the smallest sample size for which
power is higher than 0.80 is 297 (actual power = 0.802). The next
sample size that gives a power above 0.80 is 302, but the actual power
is now less than 0.801. We would need 305 samples (actual power =
0.806) to obtain a power that is
higher than the power with 297, and no less than 318 samples (actual
power = 0.802) if no larger sample size should lead to a power less
than 0.80.

<<sampleSizeFigure>>=
ss <- 275:325
(pd <- coef(rescale(d.prime = .9, method = "triangle"))$pd)
pwr <- sapply(ss, function(x)
              discrimPwr(pdA = pd, sample.size = x, pGuess = 1/3))
pwrN <- sapply(ss, function(x)
               discrimPwr(pdA = pd, sample.size = x, pGuess = 1/3,
                          statistic = "normal"))

@

\setkeys{Gin}{width=.6\textwidth}
\begin{figure}
  \centering
<<sampleSizeFigurePlot, fig=TRUE>>=

plot(range(ss), c(0.74, max(pwrN)), type = "n", xlab = "sample size",
     ylab = "power", axes = FALSE)
lines(ss, pwr, type = "l")
points(ss, pwr, pch = 20, cex = .8)
abline(h = 0.8, lty = 2)
lines(ss, pwrN, col = "blue")
legend("topleft", legend = c("exact binomial", "normal approx.",
                    "target power"), lty = c(1, 1, 2),
       pch = c(20, NA, NA), col = c(1, "blue", 1), pt.cex = .8)
## axis(2, at = c(0.72, 0.75, 0.80, 0.85))
axis(2, las = 1)
axis(1, at = c(275, 297, 297+21, 325))
segments(297, 0.6, 297,
         discrimPwr(pd, sample.size = 297, pGuess = 1/3), lty = 3)
segments(297+21, 0.6, 297+21, discrimPwr(pd, sample.size = 297+21,
                                         pGuess = 1/3), lty = 3)

@
  \caption{The relation between sample size and power for a difference
test with the triangle protocol. The null hypothesis is that of no
difference and $d' = 0.9$ is assumed under the alternative model. A
power of 0.80 is desired. }
\label{fig:powerSampleSize}
\end{figure}

Even though an increase in sample size may lead to a decrease in
power, it will instead lead to a decrease in the actual
$\alpha$-level. This occurs because the critical value of the test is
at times piece-wise constant as a function of sample size,
cf. Fig.~\ref{fig:xcrAlpha}.

\setkeys{Gin}{width=.49\textwidth}
\begin{figure}
  \centering
<<xcrFig, fig=TRUE, height=3.5, width=5.5>>=
xcr <- sapply(ss, function(ss) findcr(ss, p0 = 1/3))
plot(ss, xcr, type = "l", axes = FALSE,
     ylab = "Critical value", xlab = "Sample size")
points(ss, xcr, pch = 20)
axis(1); axis(2)

@
<<alphaFig, fig=TRUE, height=3.5, width=5.5>>=
aa <- sapply(ss, function(n)
             1 - pbinom(findcr(n, p0 = 1/3) - 1, n, 1/3))
plot(ss, aa, pch = 20, ylim = c(0.03, .05), axes = FALSE,
     ylab = "Actual alpha-level", xlab = "Sample size")
lines(ss, aa, lty = 1)
axis(1); axis(2)

@
  \caption{Left: critical value of the test. Right: actual
    $\alpha$-level of the test.}
\label{fig:xcrAlpha}
\end{figure}

The sample size for the exact binomial test may be computed with much
the same while loop that could also be used to find the critical value
(cf. section~\ref{sec:critical-value}):
\begin{algorithmic}
  \STATE $i = 1$
  \WHILE{$\textup{actual power}(i) < \textup{target power}$}
  \STATE $i = i + 1$
  \ENDWHILE
  \RETURN $i$
\end{algorithmic}
where actual power depends on the hypothesis, cf.~\eqref{eq:33} and
\eqref{eq:34}. The problem with this approach is that if the required
sample size is large, it may take some time to get there; recall that
at every evaluation of the actual power, the critical value has to be
determined. Due to the non-monotonicity of the relationship between
power and sample size (cf. Fig.~\ref{fig:powerSampleSize}), it is not
possible to simply solve for the required sample size numerically.

An improvement over the simple while loop is suggested by the normal
approximation to the required sample size shown in
Fig.~\ref{fig:powerSampleSize} in blue. This approximation seems to
estimate the sample size too low, and to do so consistently. For the
example considered here, the normal approximation estimates that 291
samples is enough to obtain a power of 0.80 (actual power =
0.8001). The while loop could simply be started at $i = 291$ rather
than at $i = 1$. A problem with this approach is that the normal
approximation is not always strictly liberal. In the function
\texttt{discrimSS} in package \textsf{sensR} a
compromise is used, where the while loop is started at one if the
sample size estimated by the normal approximation is less than
50. Otherwise the while loop is started at 90\% of the normal
approximation estimate and sometimes even lower if necessary. If the normal
approximation estimate is larger than 10,000, the function will
inform of that and not attempt to estimate the sample size due to the
expected large computation time. In addition to the sample size for
the 'exact' binomial test, it is also possible to ask for the sample
size based on the normal approximation.

\paragraph{Example:}

Consider the example above illustrated in
Fig.~\ref{fig:powerSampleSize}; we wanted to know the sample size for
a difference test where the null hypothesis is that of no difference
using the triangle protocol. We want a power of 0.80, take $\alpha =
0.05$ and we assume the
actual difference is $d' = 0.9$ under the alternative
hypothesis. Using package \textsf{sensR} we may get the sample size
for the exact binomial test with
<<exactSSexample, results=verb, echo=TRUE>>=
(pd <- coef(rescale(d.prime = .9, method = "triangle"))$pd)
discrimSS(pdA = pd, pd0 = 0, target.power = 0.8, alpha = 0.05, pGuess
          = 1/3, test = "difference", statistic = "exact")
@
We could also obtain the normal approximation with
<<normalSSexample, results=verb, echo=TRUE>>=
discrimSS(pdA = pd, pd0 = 0, target.power = 0.8, alpha = 0.05, pGuess
          = 1/3, test = "difference", statistic = "normal")
@


%% \subsection{Related literature}

%% \section{The 2-AC discrimination and preference protocol}
%%
%% The statistical methodology for the 2-AC protocol is treated in detail
%% in (refto 2-AC paper), and we refer to this source for more
%% information. $\square \bs \s$ and $\blacksquare$ and \ldots
%%
%% \begin{table}
%%   \centering
%%   \caption{Possible outcomes in the 2-AC protocol.
%%     \label{tab:twoACoutcomes}}
%%   \begin{tabular}{lccccccc}
%%     \hline
%%     & \multicolumn{7}{c}{cases} \\
%%     \cline{2-8}
%%     & 0 & 1 & 2 & 3 & 4 & 5 & 6 \\
%%     & $\bs \bs \bs$ & $\bs \s \s$ & $\s \bs \s$ & $\s \s \bs$ & $\bs
%%     \bs \s$ & $\s \bs \bs$ & $\bs \s \bs$ \\
%%     \hline
%%     $\ell(\tau, d'_0)$ & --- & --- & 0 & --- & --- & --- & --- \\
%%     $\ell(\hat\tau, \hat d')$ & --- & 0 & 0 & 0 & --- & --- & --- \\
%%     $\hat\tau$ & --- & 0 & NA & 0 & NA & NA &  0 \\
%%     $\tau_0$ & --- & 0 & NA & 0 & --- & --- & 0 \\
%%     $\hat d'$ & --- & $-\infty$ & NA & $\infty$ & $-\infty$ & $\infty$
%%     & --- \\
%%     var-cov & --- & NA & NA & NA & NA & NA & NA \\
%%     \hline
%%   \end{tabular}
%% \end{table}
%%
%% Write about
%% \begin{itemize}
%% \item Power computations and why the error is at most $\varepsilon$
%% \item How the likelihood root statistic when $d'_0 = 0$ is identical
%%   to the likelihood root statistic from the 2-AFC test. This implies
%%   that the outcome in the second cell is irrelevant for judging
%%   deviations from $d' = 0$ and rutines for the 2-AFC protocol can be
%%   used. However, whenever $d'_0 \neq 0$ the test statistics differ and
%%   the 2-AFC rutines cannot be used.
%% \item In which of the seven cases is tau constant no matter the value
%%   of $d'$? - for $d'_0$ and $\hat d'$ separately? What is the value of
%%   $\tau$ in those cases?
%% \end{itemize}
%%

%% \section{A-not A and same-different protocols}
%% \label{sec:not-same-different}
%%
%%
%% \section{Replicated simple discrimination protocols}
%% \label{sec:repl-simple-discr}
%%
%% \subsection{The beta-binomial model}
%% \label{sec:beta-binomial-model}
%%
%% \marginpar{introduce data and model here}
%%
%% The beta-binomial model is derived from the marginal distribution of
%% the binomial response, $X_i$, when the probabilities, $\pi_i$ are
%% allowed to have a beta-distribution on the interval $(0; 1)$.
%%
%% The likelihood function for the beta-binomial model is
%% \begin{equation}
%%   \label{eq:17}
%%   \ell(\alpha, \beta; x, n) = - N \log Beta(\alpha, \beta) +
%%   \sum_{i=1}^N \log Beta(\alpha + x_i, \beta - x_i + n_i)
%% \end{equation}
%% where $i = 1, \ldots, N$ is the number of independent binomial
%% observations each with $x_i$ successes out of $n_i$ trials, $Beta$ is
%% the beta function with parameters $\alpha$ and
%% $\beta$.
%% The beta-binomial model can be parameterized in terms of the mean
%% binomial parameter, $\mu$ and the degree of over-dispersion, $\gamma$
%% both living in the interval $(0; 1)$. The relation to the $(\alpha,
%% \beta)$ parameters is:
%% \begin{equation}
%%   \label{eq:18}
%%   \mu = \alpha/(\alpha + \beta) \quad
%%   \gamma = 1/(\alpha + \beta + 1)
%% \end{equation}
%% $\gamma$ can also be interpreted as a correlation\ldots
%%
%% In a sensory experiment $\mu$ can be interpreted as the probability of
%% a correct answer, $p_c$ averaged over subjects. The corresponding
%% values of $p_d$ and $d'$ can be found by eq.~\eqref{eq:2} and
%% \eqref{eq:6}.
%%
%% The standard error of the $\mu$ and $\gamma$ can be found from the
%% variance-covariance matrix of the parameters, which in turn can be
%% found from the Hessian of the likelihood function. The standard errors
%% of the population $p_d$ and population $d'$ can also be found from
%% eq.~\eqref{eq:9} and \eqref{eq:10}.
%%
%%
%% \subsection{The chance-corrected beta-binomial model}
%% \label{sec:chance-corr-beta}
%%
%% In the chance-corrected beta-binomial model the probability of a
%% correct answer is only allowed in the interval $(p_g; 1)$. This means
%% that the probability of a correct answer for any one individual is not
%% allowed to be less than the guessing probability. This is sensible
%% since it is impossible for the underlying probability of a correct
%% answer for any individual to be less than the guessing probability for
%% any of the simple sensory discrimination protocols.
%%
%% The likelihood function for the chance-corrected beta-binomial model
%% is
%% \begin{align}
%%   \ell(\alpha, \beta; x, n) =&~ - N \log Beta(\alpha, \beta) +
%%   \sum_{j=1}^N \log \nonumber \\
%%   &~\left\{ \sum_{i=1}^{x_j} {{x_j} \choose i}
%%     (1-p_g)^{n_j-x_j+i} p_g^{x_j-i}
%%     Beta(\alpha + i, n_j - x_j + \beta) \right\}
%%   \label{eq:23}
%% \end{align}
%% The parameters, $\mu$ and $\gamma$ still live in the interval $(0;
%% 1)$, but $\mu$ now has to be interpreted on the $p_d$ scale rather
%% than the $p_c$ scale. Transformation to the other scales and
%% computation of standard errors follow the approach in
%% section~\ref{sec:simple-discr-meth}.
%%
%% \subsection{A mixture of discriminators and non-discriminators}
%% \label{sec:mixt-discr-non}
%%
%% It does not have to be assumed that assessors are either
%% discriminators or non-discriminators. These assessor types can be
%% regarded as the extreme endpoints in-between which the population of
%% assessors distribute.
%%
%%
%% \subsection{Difference testing in replicated experiments}
%% \label{sec:diff-test-repl}
%%
%% \subsubsection{Model based likelihood ratio tests}
%% \label{sec:model-based-likel}
%%
%% A likelihood ratio test of over-dispersion can be calculated by
%% comparing the likelihood of the beta-binomial model with the
%% likelihood of the standard binomial model.
%% Suppose 20 panelists each conduct 10 triangle tests, with $x_i$
%% number of correct responses out of $n_i = 10$ trials for all $i$,
%% where $i = 1, \ldots, N$, $N = 20$.
%%
%% The likelihood ratio test statistic for the test of over-dispersion is
%% \begin{equation}
%%   \label{eq:35}
%%   LR_{over-disp} = 2 \{\ell_{beta-bin}(\hat\mu, \hat\gamma; x; n) -
%%   \ell_{binom}(\hat\mu; x, n)\} ,
%% \end{equation}
%% where $\ell_{beta-bin}(\hat\mu, \hat\gamma; x; n)$ is the log-likelihood of
%% the (chance-corrected) beta-binomial model given by by (\ref{eq:17})
%% or (\ref{eq:23}) evaluated at the ML estimates and
%% $\ell_{binom}(\hat\mu; x, n)$ is the value of
%% \begin{equation}
%%   \label{eq:36}
%%   \ell_{binom}(\mu; x, n) = \sum_{i = 1}^N \left\{ \log {n_i \choose x_i} x_i
%%   \log \mu + (n_i - x_i) \log (1 - \mu) \right\}
%% \end{equation}
%% evaluated at the ML estimate $\hat\mu = \sum_i x_i / \sum_i n_i$. Here
%% the alternative hypothesis is that $ X_i \sim \textup{binom}(\hat\mu,
%% n_i)$; that observations from different individuals are independent
%% and binomially distributed with the same probability of a correct
%% answer; $p_c = \mu$.
%%
%% The likelihood ratio statistic asymptotically follows a
%% $\chi_1^2$-distribution, so the $p$-value is given by:
%% \begin{equation}
%%   \label{eq:37}
%%   p\textup{-value} = P(\chi^2_1 \geq LR_{over-disp})
%% \end{equation}
%% We assume a 1 degree of freedom reference
%% distribution, but this may not be appropriate since this is in fact a
%% test of $\gamma = 0$ versus $\gamma > 0$ and hence a test on the
%% boundary of the parameter space for $\gamma$. The appropriate df may
%% be closer to one half, but this has not been empirically
%% justified. One df corresponds to the two-sided test and the test is in
%% fact one-sided, which motivates halving the degrees of freedom.
%% Choosing df = 1 is on the safe side since this leads to
%% $p$-value which may be a little too large.
%%
%%
%% The test of ``any difference'', is a test of $H_0:~ \mu_i = p_g$ for
%% all $i$, i.e. for all individuals, versus the general alternative that
%% for some individuals at least the probability of a correct answer is
%% different from the guessing probability, $p_g$ and \emph{possibly}
%% varies among individuals. The likelihood under the null hypothesis,
%% $\ell_{binom}(p_g; x, n)$ is given by eq. (\ref{eq:36}) where $\mu =
%% p_g$, and the likelihood ratio statistic is given by
%% \begin{equation}
%%   \label{eq:38}
%%   LR_{any-diff} = 2 \{\ell_{beta-bin}(\hat\mu, \hat\gamma; x; n) -
%%   \ell_{binom}(p_g; x, n)\} .
%% \end{equation}
%%
%% We assume a $\chi_2^2$ reference distribution for this statistic, so
%% the $p$-value is
%% \begin{equation}
%%   \label{eq:39}
%%   p\textup{-value} = P(\chi^2_2 \geq LR_{any-diff})
%% \end{equation}
%% while this is still partly a test at the boundary of the parameter
%% space, so the stated $p$-value may be a little too large.
%%
%% Yet another test of ``any difference'' is possible. First realize that
%% if there really is no sensory difference between the objects/products,
%% then the probability of a correct answer will for all assessors be
%% $p_g$ and there is no room for over-dispersion. We may therefore test
%% $H_0:~ \mu = p_g$ versus $H_1:~ \mu > p_g$ where $\mu$ is the average
%% probability of a correct answer and
%% $\hat\mu = \sum_i x_i / \sum_i n_i$.
%% The likelihood ratio statistic for this test is
%% \begin{equation}
%%   \label{eq:40}
%%   LR_{mean-diff} = 2 \{\ell_{binom}(\hat\mu; x, n) -
%%   \ell_{binom}(p_g; x, n)\}
%% \end{equation}
%% which asymptotically follows a $\chi_1^2$-distribution, so the
%% $p$-value is given by:
%% \begin{equation}
%%   \label{eq:41}
%%   p\textup{-value} = P(\chi^2_1 \geq LR_{mean-diff}) .
%% \end{equation}
%%
%% These three models are nested, so the $LR$-statistics are additive in
%% the following way
%% \begin{equation}
%%   \label{eq:42}
%%   LR_{any-diff} = LR_{mean-diff} + LR_{over-disp}
%% \end{equation}
%% and the degrees of freedom add up similarly. This can be used to
%% provide some insight in to the power of these tests under various
%% scenarios. If there is only little or perhaps no over-dispersion at
%% all, then the mean difference test will provide the most powerful
%% test. If most of the structure in the data is due to over-dispersion,
%% then the test directly targeted on over-dispersion will be the most
%% powerful. It is, however, not possible to observe over-dispersion
%% without a difference in mean, so the joint test of mean and
%% dispersion; the test of any difference, may be almost as
%% powerful. We expect that when the beta-binomial model is appropriate,
%% i.e. when there is appreciably over-dispersion, the test of any
%% difference will be Superior to the other two tests or at least at
%% least as good as them.
%%

<<Initialize, echo=FALSE, results=hide>>=

options(op)

@

\bibliography{meth}
%% \newpage
%% \appendix
%% \input{Appendix.tex}


\end{document}


<<eval=FALSE>>=



@



