%% LyX 2.4.4 created this file.  For more info, see https://www.lyx.org/.
%% Do not edit unless you really know what you are doing.
\documentclass[american,noae]{scrartcl}
\usepackage{lmodern}
\renewcommand{\sfdefault}{lmss}
\renewcommand{\ttdefault}{lmtt}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{float}
\usepackage{url}
\usepackage{amsmath}
\usepackage{geometry}
\geometry{verbose,tmargin=1in,bmargin=1in,lmargin=1in,rmargin=1in}
\usepackage[authoryear]{natbib}

\makeatletter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
<<echo=F>>=
  if(exists(".orig.enc")) options(encoding = .orig.enc)
@
\providecommand*{\code}[1]{\texttt{#1}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
%\VignetteIndexEntry{Using rockchalk}

\usepackage{Sweavel}
\usepackage{graphicx}
\usepackage{color}

\usepackage[samesize]{cancel}



\usepackage{ifthen}

\makeatletter

\renewenvironment{figure}[1][]{%
 \ifthenelse{\equal{#1}{}}{%
   \@float{figure}
 }{%
   \@float{figure}[#1]%
 }%
 \centering
}{%
 \end@float
}

\renewenvironment{table}[1][]{%
 \ifthenelse{\equal{#1}{}}{%
   \@float{table}
 }{%
   \@float{table}[#1]%
 }%
 \centering
%  \setlength{\@tempdima}{\abovecaptionskip}%
%  \setlength{\abovecaptionskip}{\belowcaptionskip}%
% \setlength{\belowcaptionskip}{\@tempdima}%
}{%
 \end@float
}


%\usepackage{listings}
% Make ordinary listings look as if they come from Sweave
\lstset{tabsize=2, breaklines=true, %,style=Rstyle}
                        fancyvrb=false,escapechar=`,language=R,%
                        %%basicstyle={\Rcolor\Sweavesize},%
                        backgroundcolor=\Rbackground,%
                        showstringspaces=false,%
                        keywordstyle=\Rcolor,%
                        commentstyle={\Rcommentcolor\ttfamily\itshape},%
                        literate={<<-}{{$\twoheadleftarrow$}}2{~}{{$\sim$}}1{<=}{{$\leq$}}2{>=}{{$\geq$}}2{^}{{$^{\scriptstyle\wedge}$}}1{==}{{=\,=}}1,%
                        alsoother={$},%
                        alsoletter={.<-},%
                        otherkeywords={!,!=,~,$,*,\&,\%/\%,\%*\%,\%\%,<-,<<-,/},%
                        escapeinside={(*}{*)}}%


% In document Latex options:
\fvset{listparameters={\setlength{\topsep}{0em}}}
\def\Sweavesize{\scriptsize} 
\def\Rcolor{\color{black}} 
\def\Rbackground{\color[gray]{0.90}}

\usepackage{dcolumn}
\usepackage{siunitx}

\makeatother

\usepackage{babel}
\usepackage{listings}
\renewcommand{\lstlistingname}{\protect\inputencoding{latin9}Listing}

\begin{document}
\title{Using rockchalk for Regression Analysis\thanks{In case you want to be involved in development, the source code is
available on GitHub. Please browse http://github.com/pauljohn32. The
read-only git archive address is \protect\url{git://github.com/pauljohn32/rockchalk.git}.
Once you learn how to ``clone'' the repository with a git client,
then we can discuss ways for you to contribute.}}
\author{Paul E. Johnson}
\maketitle
\begin{abstract}
The rockchalk package includes functions for estimating regressions,
extracting diagnostic information from them, preparing presentable
tables, and creating helpful two and three dimensional plots. It is
primarily intended to facilitate teachers and students who are conducting
ordinary least squares and elementary generalized models. Compared
to other packages that help with regression analysis, this one offers
more-than-the-usual amount of emphasis on the analysis of regression
models that include interaction terms. It includes functions that
can receive a fitted model and then return standardized, mean-centered,
and residual-centered regression results. The plotting functions address
some limitations of R's \code{termplot} function. Version 1.8 introduces
a consistent framework for the creation of ``\code{newdata}'' objects
and the calculation of predicted values so that the marginal effects
of separate predictors can be easily explored.
\end{abstract}
% In document Latex options:
\fvset{listparameters={\setlength{\topsep}{0em}}}
\SweaveOpts{ae=F,nogin=T}

<<Roptions, echo=F>>=
options(device = pdf)
options(width=80, prompt=" ", continue="  ")
options(useFancyQuotes = FALSE) 
@

\section{Introduction}

This is the Spring, 2019 update of the rockchalk package. I offer
a course in regression analysis for social and behavioral scientists
every year. As the course goes on, I keep track of the difficulties
that the students experience with R and I craft functions that facilitate
their work on particular assignments. The new features in this edition
are the functions \code{descriptiveTables()} and \code{waldt()}
(for regression reports), tools for regression simulation, including
\code{genCorrelatedData3()} and \code{cutFancy()}, some support
for common recoding support for multi-level modeling. Finally, the
regression table-writing function the \code{outreg()} can now generate
decimal-aligned columns. A new vignette demonstrating the \code{outreg()}
function is introduced with the package.

As in previous versions, a core of the effort is development of an
internally coherent framework that allows students to estimate regressions
(usually with lm and glm) and then create useful summary tables and
diagnostic reports. It is easy to create ``newdata'' objects and
interact with fitted regressions. The \code{newdata()} function will
be handy for just about any kind of regression. For the common kinds
of regression in base R and lme4, there is also a convenience function
called \code{predictOMatic()} that allows one to specify a regression
model and then receive predicted values for a range of input values.
The default output will be similar to other regression software tools
that provide ``marginal effect'' estimates for regression analysis. 

I have exerted a great deal of effort to improve the consistency of
predicted value plots for regression analysis. The \code{plotSlopes()}
function is a plotter for linear models with interaction terms. It
draws a plot on the screen, but it also creates an output object that
can be subjected to a follow-up hypothesis test (by the \code{testSlopes()}
function). There is a plot method for testSlopes objects to help with
analysis of interactions. 

For people who want to plot nonlinear regressions, the function \code{plotCurves()}
is a replacement for \code{plotSlopes()}. It can handle any sort
of nonlinearity and interaction on the right hand side. Where some
regression formula can cause R's \code{termplot()} function to fail,
\code{plotCurves()} will succeed.

The core functionality of many functions has not been altered. Functions
to provide standardized regression coefficients, mean-centered, or
residual-centered regression coefficients have not changed much, although
the plots of them are improved. 

\section{Facilitating Collection of Summary Information}

\subsection{summarize: A replacement for summary}

When an R function provides output that is not suitable for a term
paper, the student must cobble together some code and assemble a customized
table. In my opinion, R's summary() function for data frames is not
adequate. It lacks summaries for diversity of observations. In addition,
the output is not formatted in a way that is conducive to the creation
of plots or tables. As a result, I offer the function summarize(),
which separates the numeric from factor variables and provides an
alphabetized summary. 

<<echo=F>>=
library(rockchalk)
@

Consider the results of applying summarize() to Fox's Chile data set
from the carData package:

<<echo=T>>=
library(carData)
data(Chile)
(summChile <- summarize(Chile))
@

The result object is a list that includes a data frame for the numeric
variables, with named rows and (optionally) alphabetized columns,
as well as a separate report for each factor variable. Users who wish
to summarize only the numeric variables can run summarizeNumerics()
instead, while others who want to summarize only factors can run summarizeFactors().
The output from summarizeFactors() is a list of factor summaries. 

A companion function is centralValues(), which will provide only one
number for each variable in a data frame. For numerics, it returns
the mean, while for factor variables, it returns the mode. 

<<echo=T>>=
centralValues(Chile)
@

\subsection{Easier predictions and newdata objects}

Students struggle with R's predict() methods. The primary challenge
is in the creation of a newdata object of interesting values of the
predictors. If the newdata argument, here myNewDF in \bgroup\inputencoding{latin9}\lstinline!predict(m1, newdata = myNewDF)!\egroup
 is not exactly right, the command will fail. If the regression formula
uses functions like ``as.numeric()'' or ``as.factor()'', its almost
impossible for first-time R users to get this right. In version 1.8,
I believe I have solved the problem entirely with the functions newdata()
and predictOMatic(). 

Let's fit an example regression with the Chile data set from the carData
package.

<<echo=T>>=
m1 <- lm(statusquo ~ age + income + population + region + sex, data = Chile)
@

\noindent The default output of predictOMatic() will cycle through
the predictors, one by one.

<<echo=T>>=
m1pred <- predictOMatic(m1)
m1pred
@

The newdata() and predictOMatic() functions handle the details and
allow the user to adjust their requests to allow for very fine grained
control. Here are the key arguments for these functions.
\begin{enumerate}
\item divider. The name of an algorithm to select among observed values
for which predictions are to be calculated. These are the possibilities.

\begin{description}
\item [{``seq''}] an evenly spaced sequence of values from low to high
across the variable.
\item [{``quantile''}] quantile values that eminate from the center of
the variable ``outward''. 
\item [{``std.dev.''}] the mean plus or minus the standard deviation,
or 2 standard deviations, and so forth.
\item [{``table''}] when a variable has only a small set of possible
values, this selects the most frequently observed values.
\end{description}
\item n. The number of values to be selected.
\item predVals. Where the divider argument sets the default algorithm to
be used, the predVals argument can choose variables for focus and
select divider algorithms separately for the variables. It is also
allowed to declare particular values.
\end{enumerate}
The user can request that particular values of the predictors are
used, or can declare one of several algorithms for selection of focal
values. All of these details are managed by the argument called predVals,
which is described in the help pages (with plenty of examples). 

<<echo=T>>=
mypred2 <- predictOMatic(m1, predVals = c("age", "region"), n = 3)
mypred2
mypred3 <- predictOMatic(m1, predVals = c(age = "std.dev.", region = "table"), n = 3)
mypred3
mypred4 <- predictOMatic(m1, predVals = list(age = c(18, 30, 45), region = c("SA", "C","N")), n = 3)
mypred4
@

I've invested quite a bit of effort to make sure this works dependably
with complicated regression formulae. All formulae, such as y \textasciitilde{}
x1 + log(x2 + alpha) + ploy(x3, d) will just work. (In case one is
interested to know \emph{why} this works, the secret recipe is a new
function called model.data(). Under the hood, this required some hard
work, a frustrating chain of trial and error that is discussed in
the vignette \emph{Rchaeology}, which is distributed with this package).

The \code{predictOMatic()} function will work when the regression
model provides a predict function that can handle the newdata argument.
If the regression package does not provide such a function, the user
should step back and use the newdata function to create the examplar
predictor data frames and then calculate predicted values manually.
The newdata() function will set all variables to center values and
then it will create a ``mix and match'' combination of the ones
that the user asks for. This sequence shows ways to ask for various
``mix and match'' combinations of values from age and region. 

<<echo=T>>=
mynewdf <- newdata(m1, predVals = c("age","region"), n = 3)
mynewdf
@

<<echo=T>>=
mynewdf2 <- newdata(m1, predVals = list(age = "std.dev.", region = c("SA", "C","N")))
mynewdf2
@

<<echo=T>>=
mynewdf3 <- newdata(m1, predVals = list(age = c(20, 30, 40), region = c("SA", "C","N")))
mynewdf3
@\\
Of course, functions from rockchalk or any other R package can be
placed into the process for choosing focal values. 

<<echo=T>>=
mynewdf <- newdata(m1, predVals = list(age = getFocal(Chile$age, n = 3), region = getFocal(Chile$region, n = 3)))
mynewdf
@\\
The function getFocal() is a generic function; it will receive variables
of different types and ``do the right thing.'' By default, focal
values of numeric variables are quantile values. For factor values,
the most frequently observed values are selected. These are customizable,
as explained in the documentation. The \code{newdata()} output can
be used in a predict() function call as demonstrated above. 

It would be nice if every regression model's predicted values were
accompanied by 95\% confidence intervals. Models fit by \code{lm()}
can supply confidence intervals, but not \code{glm()}. At the current
time, there are many competing methods that might be used to calculate
those intervals; predict.glm() in R's stats package avoids the issue
entirely by not calculating intervals. In rockchalk-1.8, I crafted
some code to calculate confidence intervals for glm objects using
the (admittedly crude) Wald-based approximation. In the scale of the
linear predictor, we calculate a 95\% confidence interval, and then
use the inverse link function to transform that onto the space of
the observed response. In this example, I replicate an example that
is an R classic, from the help page of predict.glm. The reader will
note that the output includes a warning about the construction of
the confidence interval.

<<echo=T>>=
df <- data.frame(ldose = rep(0:5, 2), sex = factor(rep(c("M", "F"), c(6, 6))), 
	SF.numdead = c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16))      
df$SF.numalive <-  20 - df$SF.numdead
budworm.lg <- glm(cbind(SF.numdead, SF.numalive) ~ sex*ldose, data = df,  family = binomial)
predictOMatic(budworm.lg, predVals = c(ldose = "std.dev.", sex = "table"), interval = "confidence")  
@

\subsection{Descriptive Tables}

In regression-based reports, it is common for authors to present a
brief summary table to represent the data. The \code{descriptiveTable()}
function was created to make that process easier and more predictable.
The aim is to provide the usual summary statistics for the numeric
predictors and a table of proportions for the categorical predictors.
See Table \ref{tab:descriptiveTable}. The user can customize the
requested statistical summaries for numeric predictors.

\begin{table}[H]
\caption{The descriptiveTable function\label{tab:descriptiveTable}}

<<>>=
descriptiveTable(m1, digits=6)
@
\end{table}

\noindent I my opinion, the output is, well, simple and understandable.
More importantly, it will be sufficient for a term paper, once it
is converted into a presentable final format. It can be converted
into HTML or LaTeX with various R packages, such as xtable. Note that
because this descriptive table is derived from the fitted regression,
it summarizes \code{only the cases} that were actually used in the
regression analysis. If cases were lost by listwise deletion, the
deleted cases are not taken into account here.

\section{Better Regression Tables: Some outreg Examples.}

On May 8, 2006, Dave Armstrong, who was a political science PhD student
at University of Maryland, posted a code snippet in r-help that demonstrated
one way to use the ``cat'' function from R to write \LaTeX{} markup.
That gave me the idea to write a \LaTeX{} output scheme that would
help create some nice looking term and research papers. I wanted ``just
enough'' information, but not too much.

Since 2006, many new R packages have been introduced for the creation
of regression tables, but I still prefer to maintain \code{outreg}.
I fight to keep this simple, but have added some features in response
to user requests. For me, the biggest question is whether the format
should be ``wide'' or ``tight''. A tight format stacks estimates
and standard errors into a single column, while a wide format has
them in separate columns. In response to user requests, stars for
statistical significance were made customizable. It is now allowed
to directly insert vectors of parameter estimates and standard errors.
In this version, a new parameter \code{centering} is introduced for
decimal-centered columns. 

In the following, I will demonstrate some tables for lm and glm fits
on a simulated data set. This new simulation function, genCorrelatedData3(),
is a convenient way to create multiple-predictor regression data sets
of arbitrary size and complexity, allowing for interactions and nonlinearity. 

<<createdata1, echo=T>>=
set.seed(1234)
dat <- genCorrelatedData3(N = 100, means = c(x1 = 0, x2 = 10, x3 = 0), sds = c(1, 2, 1), rho = c(0, 0, 0), stde = 10, beta = c(0, -3, 4, 0), verbose = FALSE)
@\\
That creates a new matrix with variables $x1,$ $x2$, $x3$, and
y. We run some linear regressions and then create a categorical output
variable for a logistic regression.

<<createdata1, echo=T>>=
m1 <- lm(y ~ x1 + x2, data = dat)
m2 <- lm(y ~ x2, data = dat)
m3 <- lm(y ~ x1 + x2 + x3, data = dat)
## Create categorical variant
myilogit <- function(x) exp(x)/(1 + exp(x))
dat$y3 <- rbinom(100, size = 1, p = myilogit(scale(dat$y)))
gm1 <- glm(y3 ~ x1 + x2, data = dat)
@

The outreg examples are offered in Tables \ref{tab:Tab1} through
\ref{tab:Combined-OLSGLM}. Table \ref{tab:Tab1} is the ``tight
format output for three models, obtained from \code{outreg(list(m1, m2, m3), centering=\textquotedbl siunitx\textquotedbl )}.
On the other hand, as illustrated in Table \ref{tab:Tab2}, the author
can request a wide table (tight = FALSE) and provide more elaborate
model labels and adjust the significance stars. In Table \ref{tab:Combined-OLSGLM},
observe that the linear and generalized linear model output peacefully
co-exist, side-by-side. This output is, in my opinion, completely
acceptable for inclusion in a professional presentation or conference
paper. There are some warts in this output. 

The default output from \code{outreg} will have left-aligned columns.
Because users requested decimal-centered columns, the argument centering
has been introduced. If \code{centering=\textquotedbl siunitx\textquotedbl}
or \code{centering=TRUE}, the \code{outreg} table will be adjusted
to use features in the LaTeX package ``\code{siunitx}''. The LaTeX
package dcolumn is probably more familiar to users, and \code{\textquotedbl dcolumn\textquotedbl}
is also a legal value for centering, but the results are not quite
as good. As a \textbf{caution}, I hasten to mention that if a user
asks for decimal-centered columns, the user has the duty to insert
into the document preamble either ``\code{\textbackslash usepackage\{siunitx\}}''
or ``\code{\textbackslash usepackage\{dcolumn\}}''.

\begin{table}
\caption{Decimal-centered columns in the ``tight'' style\label{tab:Tab1}}

<<outreg10, echo = F, results=tex>>="
or <- outreg(list(m1, m2, m3), centering = "siunitx")
@
\end{table}

\begin{table}
\caption{The wide format obtained with tight = FALSE\label{tab:Tab2}}

<<outreg20, results=tex, echo=F>>=
or1 <- outreg(list("The First Model with a Long Title" = m1, "Another Model" = m3), tight = FALSE, alpha = c(0.05, 0.01, 0.001), centering = "siunitx")
@
\end{table}

\begin{table}
\caption{Combined OLS and GLM Estimates\label{tab:Combined-OLSGLM}}

<<outreg70, results=tex, echo=F>>=
or2 <- outreg(list(m1,gm1), modelLabels = c("OLS:y","GLM: Categorized y"))
@
\end{table}

I understand that some authors need to include regression tables in
documents that are not prepared with LaTeX. The parameter type = ``HTML''
will change the output format to HTML. An HTML file can be displayed
in a web browser and it can be imported in traditional ``word processor''
programs.

\section{Plotting Regressions with Interactions }

\subsection{Interaction in Linear Regression. }

One of the most fundamental skills in regression analysis is the interpretation
of interactive predictors. It is much easier for students to understand
the effect of an interaction if they can create a nice plot to show
how the predicted value depends on various values of a predictor.
The \code{plotSlopes()} function was introduced in 2010 when I was
teaching a large first-year graduate course (more than 50 students)
and it became apparent that about 20 percent of them would not be
able to manage the R coding required to draw several lines on a single
plot. Unfortunately, R's termplot() function will not draw regressions
involving interactions.

The rockchalk package has two functions to help with this, \code{plotSlopes()}
and \code{plotCurves()}. \code{plotCurves()} is more general, it
can handle any kind of formula that the user estimates. \code{plotSlopes()}
is more limited, it is only for lm objects. In return for that limitation,
\code{plotSlopes()} creates an output object which can be used to
conduct post-hoc hypothesis tests. 

\begin{figure}
<<ps05, fig=T, echo=F, include=F, height=5, width=6>>=
m1ps <- plotSlopes(m1, plotx = "x2", xlab = "x2 from model m1", interval = "confidence", opacity = 80, col = "red",
ylim = c(-70, 15), legendArgs = list(x="topright"))
@

\includegraphics[height=4in]{rockchalk-ps05}

\caption{plotSlopes: Linear Model with Confidence Interval\label{fig:ps05}}
\end{figure}

At its most elementary level, \code{plotSlopes()} is a ``one step''
regression line plotter. If the regression model includes more than
one predictor, then a single predictor is displayed on the horizontal
axis and the other predictors are set on their central values. A plot
for the model m1, that was illustrated above, is presented in Figure
\ref{fig:ps05}. In rockchalk-1.8, new arguments were added to allow
the ``see though'' confidence region. The command to generate Figure
\ref{fig:ps05} was

<<echo=T, eval=F>>=
<<ps05>>
@

\noindent I've adjusted the color and opacity to illustrate the usage
of those arguments. The y range is adjusted to make a little extra
room for the legend. The \code{plotSlopes()} function is very flexible.
All of the label, color, and scale arguments of a plot function are
also available. The plotSlopes function also works well if the moderator
is a categorical variable. 

It is important to note that the output object, m1ps, has the information
necessary to re-create the plotted line in the form of a \code{newdata}
data frame. The first few lines in the \code{newdata} object are

<<>>=
m1ps$newdata[1:3, ]
@

<<ps09, fig=F, echo=F, height=9, width=6>>=
dat$y4 <- 1 + 0.1 * dat$x1 - 6.9 * dat$x2 + 0.5 * dat$x1*dat$x2 + 0.2 * dat$x3 + rnorm(100, m = 0, sd = 10)
m4 <- lm(y4 ~ x1*x2 + x3, data = dat)
@

\begin{figure}
<<ps10, fig=T, echo=F, height=9, width=6>>=
par(mfcol=c(2,1))
m4psa <- plotSlopes(m4, plotx = "x1", modx = "x2", xlab = "x1 is a fun plotx")
m4psb <- plotSlopes(m4, plotx = "x2", modx = "x1", modxVals = "std.dev.", xlab = "x2 is plotx", ylim = c(-120, 15),
legendArgs = list(x = "topright"))
par(mfcol=c(1,1))
@

\caption{plotSlopes Illustrated\label{fig:ps10}}
\end{figure}

This is more interesting if we fit a regression with an interaction
term, such as 

\bgroup\inputencoding{latin9}
\begin{lstlisting}
m4 <- lm(y4 ~ x1*x2 + x3, data = dat)
\end{lstlisting}
\leavevmode\egroup

\noindent We then ask \code{plotSlopes} to draw the predicted values
using one numeric variable as the horizontal axis and values of another
variable (a moderator) are set at particular values. Either x1 or
x2 can be viewed as the ``moderator'' predictor, the one on which
the effect of the other depends. In rockchalk version 1.8, the selection
of values of the moderator was generalized, so that the user can specify
either a function that selects values, or a vector of values, or the
name of an algorithm. The default algorithm will choose quantile values,
but Figure \ref{fig:ps10} demonstrates also the ``std.dev.'' divider
algorithm. The code to produce that figure was

<<eval=F, echo=T>>=
<<ps10>>
@

When \code{modx} is a numeric variable, then some particular values
must be selected for calculation of predicted value lines. The \code{modxVals}
argument is used to either specify moderator values or an algorithm
to select focal values. By default, three hypothetical values of \code{plotx}
are selected (the quantiles 25\%, 50\%, and 75\%). 

If \code{modx} is a factor variable, then the most frequently observed
scores will be selected for consideration. The default display will
include the regression line as well as color-coded points for the
subgroups represented by values of the moderator.

\begin{figure}
<<echo=F>>=
fourCat <- gl(4,25, labels=c("East","West","South", "Midwest"))
dat$x4 <- sample(fourCat, 100, replace = TRUE)
dat$y5 <- 1 + 0.1 * dat$x1 + contrasts(dat$x4)[dat$x4, ] %*% c(-1,1,2) + rnorm(100,0, sd=10)
m5 <- lm (y5 ~ x1*x4 + x3, data=dat)
@

<<ps20, fig=T, echo=F>>=
m5psa <- plotSlopes(m5, plotx = "x1", modx = "x4", xlab = "x1 is a Continuous Predictor", xlim = magRange(dat$x1, c(1.2,1)))
@

\caption{plotSlopes with a Categorical Moderator\label{fig:ps20}}
\end{figure}

\noindent{}
\begin{figure}
<<ps21, fig=T, echo=F>>=
m5psb <- plotSlopes(m5, plotx = "x1", modx = "x4", modxVals = c("West","East"), xlab = "x1 is a Continuous Predictor", xlim=magRange(dat$x1, c(1.2,1)), interval = "conf")
@

\caption{plotSlopes: the interval argument\label{fig:ps21}}
\end{figure}
Suppose we have a four-valued categorical variable, ``West'',''Midwest'',
``South'', and ``East''. If that variable is used in an interaction
in the regression model, then the \code{plotSlopes} output will include
four lines, one for each region. For example, consider Figure \ref{fig:ps20},
which is created by

<<eval=F>>=
<<ps20>>
@

\noindent The categorical variable is x4. 

It is possible to superimpose confidence intervals for many subgroups,
but sometimes these plots start to look a little bit ``busy''. The
mixing of shades in overlapping intervals may help with that problem.
A plot that focuses on just two subgroups is presented in Figure \ref{fig:ps21},
which is produced by 

<<eval=F>>=
<<ps21>>
@ 

In rockchalk version 1.8, I've exerted quite a bit of effort to make
sure that colors are chosen consistently when users remove or insert
groups in these plots. The same value of the moderator should always
be plotted in the same way–the line, points, and interval colors should
not change. Note, for example, in Figures \ref{fig:ps20} and \ref{fig:ps21},
the line for East is black in both plots, while the line for West
is red in both. 

\subsection{testSlopes, a companion of plotSlopes}

The students in psychology and political science are usually interested
in conducting some diagnostic analysis of the interactive terms. \citet{Aiken1991}
(and later \citealp{Cohen2002}Cohen, Cohen, West, and Aiken) propose
using the t test to find out if the effect of the ``plotx'' variable
is statistically significantly different from zero for each particular
value of the moderator variable. The new version of rockchalk declares
a method plot.testSlopes that handles the work of plotting the interaction.

The usual case would be the following. First, carry out the regression
analysis. Then run plotSlopes, and run testSlopes, and then pass that
result object to the plot method.

\begin{figure}
<<ts10, fig=T, echo=F, height=6, width=6>>=
m4psats <- testSlopes(m4psa)
plot(m4psats)
@

\caption{testSlopes for an Interactive Model\label{fig:ts10}}
\end{figure}

\bgroup\inputencoding{latin9}
\begin{lstlisting}
m4 <- lm(y ~ x1*x2 + x3, data = dat)
m4ps <- plotSlopes(m4, plotx = "x1", modx ="x2", xlab = "x1 is a Continuous Predictor")
m4psats <- testSlopes(m4ps)
plot(m4psats)
\end{lstlisting}
\leavevmode\egroup

\noindent The output from testSlopes will differ, depending on whether
modx is numeric or a factor. If it is a factor, then the slope of
the lines and the associated t-test for each will be presented. My
psychology students call these ``simple-slopes'', following the
terminology of Aiken and West. The general idea is that we want to
know if the ``combined'' effect of plotx is not zero. For a model
stated with predictors $plotx_{i}$ and $modx_{i}$as
\begin{equation}
y_{i}=b_{0}+b_{plotx}plotx_{i}+b_{modx}modx_{i}+b_{plotx:modx}plotx_{i}\cdot modx_{i}+\ldots+e_{i}
\end{equation}

\noindent the null hypothesis would be

\noindent
\begin{equation}
H_{0}:0=\hat{b}_{simple\,slope}=\hat{b}_{plotx}+\hat{b}_{plotx:modx}modx
\end{equation}

\noindent If modx is a factor, then we simply calculate the slope
of each line and the test is straight-forward. If modx is a numeric
variable, then we confront a problem that is a bit more interesting.
We don't really want to say that the simple slope is different from
0 for particular values of $modx$, but instead we want to answer
the question, ``for which values of the moderator would the effect
of the plotx variable be statistically significant?''. This necessitates
the calculation of the so-called Johnson-Neyman interval (\citeyear{JohnsonNeyman1936}),
a plot of which is presented in Figure \ref{fig:ts10}.

The method of calculation is outlined in \citet{PreacherCurren2006}.
The values of $modx$ associated with a statistically significant
effect of $plotx$ on the outcome is determined from the computation
of a T statistic for $\hat{b}_{simple\,slope}$. The J-N interval
is the set of values of $modx$ for which the following (quadratic
equation) holds:
\begin{align}
\hat{t} & =\frac{\hat{b}_{simple\,slope}}{std.err(\hat{b}_{simple\,slope})}\nonumber \\
 & =\frac{\hat{b}_{simple\,slope}}{\sqrt{\widehat{Var(\hat{b}_{plotx})}+modx^{2}\widehat{Var(\hat{b}_{plotx\cdot modx})}+2modx\widehat{Cov(\hat{b}_{plotx},\hat{b}_{plotx\cdot modx})}}}\geq T_{\frac{\alpha}{2},df}
\end{align}

\noindent Suppose there are two real roots, $root1$ and $root2$.
The values of $modx$ for which the slope is statistically significant
may be a compact interval, $[root1,root2]$, as demonstrated in Figure
\ref{fig:ts10}, or it may two open intervals, $(-\infty,root1]$
and $[root2,\infty)$. I had expected almost all applications to result
in that latter case, but the somewhat surprising case illustrated
in Figure \ref{fig:ts10} is not too infrequently observed. 

\subsection{plotCurves for nonlinear predictor formulae}

In the most recent revision of this package, I believe I have made
the plotCurves() function redundant. Before I delete the function
entirely, I'm waiting for feedback. There was a problem in the past.
Some students used plotSlopes() (which drew straight lines) when they
intended to draw curves (and should have used plotcCurves).

plotCurves() was developed to generalize the plotting capability for
regression formulas that include nonlinear transformations. It was
for models that have polynomials or terms that are logged (or otherwise
transformed). In that sense, plotCurves() is rather similar to R's
own termplot() function, but plotCurves() has two major advantages.
First, it allows interactions, and second, it handles some complicated
formulae that termplot() is not able to manage. 

Suppose a dependent variable y5 is created according to a nonlinear
process.
\begin{equation}
y5_{i}=-3x1_{i}+7*log(x2)+1.1x2_{i}+2.2x1_{i}\times x2_{i}+e_{i}
\end{equation}

<<echo=F,include=F>>=
dat$y5 <- with(dat, -3*x1 + 15*log(0.1 + x2 - min(x2)) + 1.1*x2 + 8.2 *x1 * x2 + 10*rnorm(100))
@

The estimated model is

<<>>=
m5 <- lm(y5 ~ log(x2) + x1 * x2, data = dat)
@

In the new version, the function calls for plotSlopes() and plotCurves()
are the same:

<<pcps20a, fig=T, eval=F, height=4, width=6>>=
m5pc <- plotCurves(m5, plotx = "x2", modx = "x1", main = "plotCurves output", ylim = c(-210, 330),
legendArgs = list(x = "topleft", title = "x1", cex = 0.8))
@

<<psps20b, fig=T, eval=T, eval=F, echo = T, height = 4, width = 6>>=
m5ps <- plotSlopes(m5, plotx = "x2", modx = "x1", main = "plotSlopes outout", ylim = c(-210, 330),
legendArgs = list(x = "topleft", title = "x1", cex = 0.8))
@

So far as I have been able to tell, the results are the same. See
Figure \ref{fig:pcps20}.

\begin{figure}

<<eval=T, fig=T,echo=F,height = 4, width = 6>>=
<<pcps20a>>
@

<<eval=T, fig=T, echo=F,height = 4, width = 6>>=
<<psps20b>>
@

\caption{Illustrations of m5 with plotSlopes and plotCurves\label{fig:pcps20}}

\end{figure}


\subsection{plotPlane}

The persp() function in R works well, but its interface is too complicated
for most elementary and intermediate R users. To facilitate its use
for regression users, the plotPlane() function is offered.

The plotPlane function offers a visualization of the mutual effect
of two predictors, whether or not the regression model is linear.
plotPlane() is designed to work like plotCurves(), to tolerate nonlinear
components in the regression formula. plotPlane() allows the depiction
of a 3 dimensional curving plane that ``sits'' in the cloud of data
points. The variables that are not explicitly pictured in the plotPlane()
figure are set to central reference values. Recall model m4, which
used the formula \bgroup\inputencoding{latin9}\lstinline!y4 ~ x1*x2 + x3!\egroup
. As illustrated in Figure \ref{fig:pp100}, plotCurves() presents
a reasonable view of the predicted values.

\begin{figure}
<<pp100, fig=T, echo=F>>=
p100 <- plotPlane(m4, plotx1 = "x1", plotx2 = "x2", phi = 10, theta = -80, lcol = gray(.70))
@

\caption{plotPlane for the Interactive Model m4\label{fig:pp100}}
\end{figure}

Because plotPlane() is a simple convenience wrapper for R's persp()
function, it responds to the same customizing arguments that perp
would allow. The arguments phi and theta will rotate the figure, for
example. The output in Figure \ref{fig:pp100} is produced by the
following.

<<echo=T,eval=F>>=
<<pp100>>
@

\noindent One of the major educational benefits of the 3-D figure
is that students can easily see that a model with a simple interaction
effect is \emph{not a linear model any more. }We will return to that
point in the discussion of mean centering in regression analysis.

Recall that model m5 is the rather complicated nonlinear formula \bgroup\inputencoding{latin9}\lstinline!log(x2*x2) + x1 * x2!\egroup
. The plotCurves() output for that was already presented in Figure
\ref{fig:pcps20}. The three dimensional view of the same is presented
in Figure \ref{fig:pp110}, but with an added twist. The twist is
that the predicted value lines from the 2-D plot functions can be
superimposed on the plane. The function addLines() does the work for
translating 2-D plot object onto the regression plane. 

\begin{figure}
<<pp111, fig = T, echo = F, height = 5, results = hide>>=
ppm5 <- plotPlane(m5, plotx1 = "x2", plotx2 = "x1", phi = 0, npp = 15, lcol = gray(.80))
addLines(from = m5pc, to = ppm5, col = m5pc$col)
@

\caption{Making plotSlopes and plotPlane Work Together\label{fig:pp110}}
\end{figure}


\section{Standardized, Mean-Centered, and Residual-Centered Regressions }

\subsection{Standardized regression}

Many of us learned to conduct regression analysis with SPSS, which
reported both the ordinary (unstandardized) regression coefficients
as well as a column of ``beta weights'', the output of a ``standardized''
regression analysis. Each variable, for example $x1_{i}$, was replaced
by an estimated $Z-score:$ $(x1_{i}-\overline{x1})/std.dev.(x1_{i}$).
Some people think these coefficients are easier to interpret (but,
for a strong cautionary argument against them, see \citealp{King1986}).
R offers no such thing as standardized regression, probably because
this practice is thought to be mistaken. The automatic standardization
of all predictors, no matter whether they are categorical, interaction
terms, or transformed values (such as logs) is dangerous. 

To illustrate that, the rockchalk introduces a function called standardize().
Each column of the design matrix is scaled to a new variable with
mean 0 and standard deviation 1. The result from standardize() will
be an R lm object, which will respond to any follow-up analysis commands.
For example:

<<>>=
m4 <- lm (y4 ~ x1 * x2, data = dat)
m4s <- standardize(m4)
@

I doubt that a reasonable person would actually want a standardized
regression and have tried to warn users in the output.

<<>>=
summary(m4s)
@

\begin{table}
\caption{Comparing Ordinary and Standardized Regression\label{tab:stdreg10}}

<<stdreg10, results=tex, echo=F>>=
or10 <- outreg(list(m4, m4s), tight = F, modelLabels = c("Not Standardized","Standardized"))
@
\end{table}


\subsection{Mean-centered Interaction Models}

Sometimes people will fit a model like this
\begin{equation}
y_{i}=b_{o}+b_{1}x1_{i}+b_{2}x2_{i}+e_{i}
\end{equation}

\noindent and then wonder, ``is there an interaction between $x1_{i}$
and $x2_{i}$?'' The natural inclination is to run this model, 

\bgroup\inputencoding{latin9}
\begin{lstlisting}
m1 <- lm(y ~ x1*x2)
\end{lstlisting}
\leavevmode\egroup

\noindent or its equivalent

\bgroup\inputencoding{latin9}
\begin{lstlisting}
m2 <- lm(y ~ x1 + x2 + x1:x2)
\end{lstlisting}
\leavevmode\egroup

Researchers have been advised that they should not run the ordinary
interaction model without ``mean-centering'' the predictors (Aiken
and West, 1991). They are advised to replace $x1_{i}$ with $(x1_{i}-\overline{x1})$
and $x2_{i}$ with $(x2_{i}-\overline{x2})$, so that the fitted model
will
\begin{equation}
y_{i}=b_{o}+b_{1}(x1_{i}-\overline{x1})+b_{2}(x2_{i}-\overline{x2})+b_{3}(x1_{i}-\overline{x1})(x2_{i}-\overline{x2})+e_{i}
\end{equation}

This is a little tedious to do in R, so I provide a function meanCenter()
that can handle the details. meanCenter() will receive a model, scan
it for interaction terms, and then center the variables that are involved
in interactions. We previously fit the model m4, and now we center
it.

<<>>=
m4mc <- meanCenter(m4)
summary(m4mc)
@

By default, meanCenter() will only center the variables involved in
an interaction, and it leaves the others unchanged. The user can request
a different treatment of the variables. Version 1.8 introduces the
argument ``terms'', which allows the user to list the names of the
predictors that should be centered. If the user wants all of the numeric
predictors to be mean-centered, the usage of the argument centerOnlyInteractors
would be appropriate:

\bgroup\inputencoding{latin9}
\begin{lstlisting}
m4mc <- meanCenter(m4, centerOnlyInteractors = FALSE)
\end{lstlisting}
\leavevmode\egroup

By default, it does not standardize while centering (but the user
can request standardization with the argument standardize = TRUE.
The option centerDV causes the dependent variable to be centered as
well.

\subsection{Residual-centered Models}

Residual-centering \citep{LittleBovaird2006} is another adjustment
that has been recommended for models that include interactions or
squared terms. Like mean-centering, it is often recommended as a way
to obtain smaller standard errors or to make estimates more numerically
stable. Like mean centering, it causes a superficial change in the
estimated coefficients, but the predicted values and the regression
relationship is not actually changed. Nothing of substance is altered.

The residualCenter() function is used in the same manner as meanCenter().
The user fits an interactive model and the result object is passed
to residualCenter() like so:

<<>>=
m4rc <- residualCenter(m4)
summary(m4rc)
@

I would explain residual-centering as follows. Suppose we fit the
linear model, with no interaction (note, I'm calling the coefficients
$c_{j}$, not $b_{j}$ as usual):

\begin{equation}
y=c_{0}+c_{1}x1+c_{2}x2+e{}_{i}.\label{eq:rc20}
\end{equation}

\noindent Let's proceed as if those parameter estimates, $\hat{c}_{1}$,
$\hat{c}_{2}$, are the ``right ones'' for our analytical purpose.
We'd like to fit an interactive model, but protect the linear parts
from fluctuation. In R, when we run the model \bgroup\inputencoding{latin9}\lstinline!lm(y ~ x1 * x2)!\egroup
, we are allowing all of the coefficients of all variables to fluctuate
\begin{equation}
y_{i}=b_{o}+b_{1}x1_{i}+b_{2}x2_{i}+b_{3}x1_{i}\times x2_{i}+e_{i}\label{eq:rcwant}
\end{equation}
Residual centering is one way to stabilize the estimation by assuring
that $\hat{b}_{1}=\hat{c}_{1}$ and $\hat{b}_{2}=\hat{c}_{2}$. Only
the coefficient $\hat{b}_{3}$ floats freely.

One of the reasons that residual-centering is so appealing is that
its stabilizing benefit is obtained almost by accident. Here is the
gist of the calculations. First, estimate a regression in which the
left hand side is the interaction product term:
\begin{equation}
(x1_{i}\times x2_{i})=d_{0}+d_{1}x1_{i}+d_{2}x2+u_{i}\label{eq:residCentered}
\end{equation}

\noindent The residuals from that regression are, by definition, orthogonal
to both $x1$ and $x2$. Call those fitted residuals $\widehat{u_{i}}$.
The we run the interactive regression, replacing the column of the
predictor $x1_{i}\times x2_{i}$, with $\widehat{u}_{i}$. That is
to say, the model we want, equation (\ref{eq:rcwant}), is estimated
as: 
\begin{equation}
y_{i}=b_{0}+b_{1}x1_{i}+b_{2}x2_{i}+b3\widehat{u_{i}}+e_{i},\label{eq:rc10-1}
\end{equation}

\noindent In essence, we have taken the interaction $(x1_{i}\times x2_{i})$,
and purged it of its parts that are linearly related to $x1_{i}$
and $x2_{i}$. 

rockchalk 1.6 included summary, print, and predict methods for the
residual-centered regression objects. It is worth mentioning that
the code can handle interactions of arbitrarily many predictors. If
the formula has \bgroup\inputencoding{latin9}\lstinline!lm(y ~ x1 * x2 * x3 * x4)!\egroup
, for example, this implies many separate interactions must be calculated.
We need to calculate residual-centered residuals for $x1\cdot x2$,
$x1\cdot x3$, $x1\cdot x4$, $x2\cdot x3$ and so forth, and then
use them as predictors to get centered estimates of terms $x1\cdot x2\cdot x3$,
and then their centered values are predictors four term interactions.
Aspiring R programmers who want to learn about programming with R
formula objects might benefit from the study of the function residualCenter.R
in the rockchalk source code. 

\section{A Brief Analysis of Mean-Centering}

We can put the tools together by making a little detour into the question
that seems to plague every regression analyst at one time or another:
What does that interaction term \emph{really mean}? Along the way,
we will try to dispel the idea that centering somehow makes estimates
``better'' or more numerically precise. The primary advocates of
centering as a way to deal with numerical instability are \citet{Aiken1991},
who integrated that advice into the very widely used regression textbook,
\emph{Applied Multiple Regression/Correlation for the Behavioral Sciences}
\citep{Cohen2002}. They claim that the inclusion of interactions
causes ``inessential multicollinearity'' that is alleviated by centering.
The advice is widely followed. One statistics book intended for biologists
observed, for example, ``We support the recommendation of Aiken \&
West (1991) and others that multiple regression with interaction terms
should be fitted to data with centered predictor values'' (\citealp{QuinnKeough2002},
Chapter 6). 

Technical rebuttals have been published already (\citealp{Kromrey1998}),
but the matter still seems not widely understood. The argument is
not that mean-centering (or residual-centering) is wrong, but rather
that it is unnecessary. It is irrelevant, and possibly, misleading.

At the core of the matter is the fact that our uncertainty about regression
estimates depends on our point of view. Please review the confidence
interval in Figure \ref{fig:ps05}. The $y$ axis is not even ``in
the picture.'' Would one's appreciation of the regression's predictive
line be enhanced if the y axis were moved into the picture? 

We can move the $y$ axis by centering the predictor variable. Suppose
we replace $x_{i}$ with $x_{i}-5$ and then re-estimate the model.
That has the simple effect of moving the $y$ axis 5 units to the
right. The slope is unchanged, and the reported intercept is changed
in a completely superficial way. The predicted value line, the slope
estimates, the residual standard error, and so forth, are not changed
in any substantial way. This is simply a matter of user convenience.
I believe that no reasonable person can say the regression is ``better''
after centering $x_{i}$.

However, there is a superficial difference that has deceived many
authors. Notice that the confidence interval is hourglass shaped.
\emph{At the new $y$ axis}, our prediction is more precise. If we
move the y axis further to the right, into the center of the data,
say by mean-centering ($x_{i}-10$), we move to the position that
allows an even more precise prediction. The estimate of the intercept's
standard error will be smaller for the obvious reason. We are not
actually gaining certainty, we are simply reporting our most favorable
``snapshot'' of it. The predicted value, and the confidence interval
for any observed value of $x_{i}$ is completely unchanged by centering.

It should not surprise the reader to learn that mean-centering interactive
predictors enhances the standard errors in the same illusory way.
Let's work though an example to see why this is so tempting. Suppose
the true data generating mechanism is an interaction like so
\begin{equation}
y_{i}=2+0.1x1_{i}+0.1x2_{i}+0.2\cdot(x1_{i}\times x2_{i})+e_{i},
\end{equation}

\noindent where $e_{i}\sim N(0,300^{2})$ and $\rho_{x1,x2}=0.4$. 

A regression analysis that ignores the interaction, 

\bgroup\inputencoding{latin9}
\begin{lstlisting}
lm(y ~ x1 + x2, data = dat2)
\end{lstlisting}
\leavevmode\egroup

\noindent is reported in the first column of Table \ref{tab:meancenter10-1}.
I've used outreg's new alpha argument to emphasize the ``really good''
estimates with more stars. Notice that everything is ``statistically
significant!''

Unable to leave well enough alone, the researcher wonders, ``is there
an interaction between $x1$ and $x2$?'' Run \bgroup\inputencoding{latin9}
\begin{lstlisting}
lm(y ~ x1 * x2, data = dat2)
\end{lstlisting}
\leavevmode\egroup

\noindent The second column in Table \ref{tab:meancenter10-1} summarizes
that regression. Be prepared for a shock when you scan the estimates.
Almost everybody I know has said ``what the heck?'' or ``Holy Cow!''
or ``Oh My God, my great result went to Hell, I'll never get tenure!''
Neither of the key variables is ``statistically significant'' any
more. 

<<echo=F, include=F>>=
dat2 <- genCorrelatedData3(N=400, rho=.4, stde=300, beta=c(2,0.1,0.1,0.2))
m6linear <- lm (y ~ x1 + x2, data=dat2)
m6int <- lm (y ~ x1 * x2, data=dat2)
m6mc <- meanCenter(m6int)
m6rc <- residualCenter(m6int)
@

\begin{table}
\caption{Comparing Regressions\label{tab:meancenter10-1}}

<<mcenter10, results=tex, echo=F>>=
or11 <- outreg(list(m6linear, m6int, m6mc, m6rc), tight=F, modelLabels=c("Linear", "Interaction","Mean-centered","Residual-centered"), alpha = c(0.05, 0.01))
@
\end{table}

Cohen, et al. claim that the apparent instability of the coefficients
is a reflection of ``inessential collinearity,'' due to the fact
that $x1$ and $x2$ are correlated with the new term, $x1\times x2$.
They advised their readers to ``mean-center'' their predictors and
run the regression again. 

Remember the hourglass shape of the confidence interval. By mean-centering,
we are re-positioning ourself for a much more favorable snapshot.
The welcoming effect of the centered estimates is found in the third
column of Table \ref{tab:meancenter10-1}. The point estimates in
that snapshot are ``significant again.'' It appears we have ``solved''
the problem of inessential collinearity. 

The first hint of trouble is in the fact that the coefficients of
the interactive effects in columns 2 and 3 are identical. Those coefficients
are the same because they are estimates of the cross partial derivative
$\partial^{2}y/\partial x1\partial x2$. That particular value is
the same, no matter where we position the $y$ axis, as it should
be. Note as well that the root mean square and $R^{2}$ estimates
are identical. Everything that we expect to remain the same is the
same. Except for the fact that the slopes and their hypothesis tests
seem better in the centered model, we would think there is nothing
interesting here.

Here's a puzzle for you. Consider Figure \ref{fig:pscentering}, which
shows the predicted values and confidence intervals from the centered
and uncentered regressions. Is there any substantively important difference
between these two regressions?

\begin{figure}
<<pscenter, fig=T, echo=F, include = F, height = 10, width = 5, results = hide>>=
par(mfcol = c(2, 1))
plotSlopes(m6int, plotx = "x1", modx = "x2", modxVals = "std.dev.",  n = 2, interval = "confidence", main= "Not Centered")
plotSlopes(m6mc, plotx = "x1c", modx = "x2c", modxVals = "std.dev.", n = 2, interval = "confidence", main = "Mean Centered")
par(mfcol = c(1, 1))
@

\includegraphics[height=8in]{rockchalk-pscenter.pdf}

\caption{plotSlopes for the centered and non-centered regressions\label{fig:pscentering}}
\end{figure}

Perhaps the 2-D plots are not persuasive. Don't stop yet. In the 3-D
plots will help quite a bit. We have not yet grasped the most fundamental
changed caused by the insertion of the interaction term. When we insert
$x1_{i}\times x2_{i}$, we change the fudamental nature of the regression
surface. The surface of the fitted model is no longer a ``flat''
plane, but rather it is a curving surface. 

I've assembled 3-D plots in Figure \ref{fig:mcenter30}. We see that
the ordinary interaction, mean-centered, and residual-centered models
produce identical predicted values! The only difference in the figures
is that the axes of the predictors have been re-scaled, so that the
$y$ axis is (implicitly) re-positioned. Now that we understand the
situation, we could play around with the data and move the axis back
and forth until we arrive at a position that is most favorable to
our interpretation. 

\begin{figure}
<<mcenter50, fig=T,echo=FALSE, height=5, width=7>>=
op <- par(no.readonly = TRUE)
par(mfcol=c(2,2))
par(mar=c(2,2,2,1))
plotPlane(m6linear, plotx1="x1", plotx2="x2", plotPoints=FALSE, main="Linear", ticktype="detailed")
plotPlane(m6int, plotx1="x1", plotx2="x2", plotPoints=FALSE, main="Interaction: Not Centered", ticktype="detailed")
plotPlane(m6mc, plotx1="x1c", plotx2="x2c", plotPoints=FALSE, main="Mean-centered", ticktype="detailed")
plotPlane(m6rc, plotx1="x1", plotx2="x2", plotPoints=FALSE, main="Residual-centered", ticktype="detailed")
par(op)
@

\caption{Predicted Planes from Centered and Uncentered Fits Identical\label{fig:mcenter30}}
\end{figure}

The regression coefficient estimates are snapshots, each summarizing
the curvature at one particular point in a curving surface. It seems
quite apparent in Figure \ref{fig:mcenter30} that the models are
identical, and yet we receive different regression reports from different
spots. The non-centered data offers us the slope estimate from the
``front-left'' part of the graph. Mean-centered estimates report
on the slope in the middle of the graph. In the rockchalk examples
folder, one can find a file called ``centeredRegression.R'' that
walks through this argument step by step. 

What about residual-centering? Because the transformation that it
employs is more abstract, I initially thought it was actually a different
model. And yet it is not. The residual-centered model is completely
equivalent to the ordinary interaction model and the mean-centered
model. For a given combination of the input values, the predicted
values and confidence intervals are the same. The predicted values
of the ordinary interactive model, the mean-centered model, and the
residual-centered models are illustrated in Figure \ref{fig:rcenter40}. 

\begin{figure}
<<rcenter40, fig=T,echo=FALSE, include=F, height=4, width=8>>=
dat3 <- centerNumerics(dat2)
##m6mcpred <- fitted(m6mc) ##
m6mcpred <- predict(m6mc, newdata=dat3)
##m6rcpred <- fitted(m6rc) ##
m6rcpred <- predict(m6rc, newdata=dat3)
##m6intpred <- fitted(m6int) ##
m6intpred <- predict(m6int, newdata=dat3)
op <- par(no.readonly = TRUE)
par(mfcol=c(1,2))
plot(m6intpred, m6rcpred, main="", xlab="Predictions of Uncentered Interaction", ylab="Residual-centered Predictions")
predcor <- round(cor(m6intpred, m6rcpred),3)
legend("topleft", legend=c(paste0("Correlation=", predcor)))
plot(m6mcpred, m6rcpred, main="", xlab="Mean-centered Predictions", ylab = "Residual-centered Predictions")
predcor <- round(cor(m6mcpred, m6rcpred),3)
legend("topleft", legend=c(paste0("Correlation=", predcor)))
par(op)
@

\includegraphics[width = 6.5in]{rockchalk-rcenter40}

\caption{Predicted Values of Mean and Residual-centered Models\label{fig:rcenter40}}
\end{figure}

The take-away point from this is that there is no free lunch in regression
analysis. If re-scaling a variable by adding or subtracting a constant
seems to change a result, one should be cautious and suspect an error. 

In order to drive the point home, I'd like to show that it is possible
to translate between the estimates of any one of these fitted models
and the estimates of the others. The ordinary model is
\begin{equation}
y_{i}=b_{0}+b_{1}x1_{i}+b_{2}x2_{i}+b_{3}(x1_{i}\times x2_{i})+e1_{i}\label{eq:int}
\end{equation}

The mean-centered model is 
\begin{equation}
y_{i}=c_{0}+c_{1}(x1_{i}-\overline{x1})+c_{2}(x2_{i}-\overline{x2})+c_{3}(x1_{i}-\overline{x1})\cdot(x2_{i}-\overline{x2})+e2_{i}\label{eq:mc1}
\end{equation}

In order to compare with equation \ref{eq:int}, we would re-arrange
like so

\begin{equation}
y_{i}=c_{0}+c_{1}(x1_{i})-c_{1}\overline{x1}+c_{2}(x2_{i})-c_{2}\overline{x2}+c_{3}(x1_{i}x2_{i}+\overline{x1}\overline{x2}-\overline{x1}x2_{i}-\overline{x2}x1_{i})+e2_{i}
\end{equation}

\begin{equation}
y_{i}=c_{0}+c_{1}(x1_{i})-c_{1}\overline{x1}+c_{2}(x2_{i})-c_{2}\overline{x2}+c_{3}(x1_{i}x2_{i})+c_{3}\overline{x1}\overline{x2}-c_{3}\overline{x1}x2_{i}-c_{3}\overline{x2}x1_{i})+e2_{i}
\end{equation}

\begin{equation}
y_{i}=\{c_{0}-c_{1}\overline{x1}-c_{2}\overline{x2}+c_{3}\overline{x1}\overline{x2}\}+\{c_{1}-c_{3}\overline{x2}\}x1_{i}+\{c_{2}-c_{3}\overline{x1}\}x2_{i}+c_{3}(x1_{i}x2_{i})+e2_{i}\label{eq:mc3}
\end{equation}

One can then compare the parameter estimates from equations \ref{eq:int}
and \ref{eq:mc3}. Both \ref{eq:int} and \ref{eq:mc3} include a
single parameter times $(x1_{i}x2_{i}),$ leading one to expect that
the estimate $\hat{b}_{3}$ should be equal to the estimate of $\hat{c}_{3}$,
and they are! Less obviously, one can use the fitted coefficients
from either model to deduce the fitted coefficients from the other.
The following equalities describe that relationship.
\begin{eqnarray}
\hat{b}_{0} & = & \hat{c}_{0}-\hat{c}_{1}\overline{x1}-\hat{c}_{2}\overline{x2}+\hat{c}_{3}\overline{x1}\overline{x2}\\
\hat{b}_{1} & = & \hat{c}_{1}-\hat{c}_{3}\overline{x2}\\
\hat{b}_{2} & = & \hat{c}_{2}-\hat{c}_{3}\overline{x1}\\
\hat{b}_{3} & = & \hat{c}_{3}
\end{eqnarray}
The estimated fit of equation \ref{eq:mc1} would provide estimated
coefficients $\hat{c}_{j}$, $j=0,...,3$, which would then be used
to calculate the estimates from the non-centered model.

The estimation of the residual-centered model requires two steps.
The residual from this regression model

\noindent
\begin{equation}
\widehat{(x1_{i}\times x2_{i})}=\hat{d}_{0}+\hat{d}_{1}x1_{i}+\hat{d}_{2}x2_{i}.
\end{equation}

\noindent is 
\begin{equation}
\widehat{u}_{i}=(x1_{i}\times x2_{i})-\widehat{(x1_{i}\times x2_{i})},
\end{equation}

\noindent which is inserted into equation \ref{eq:int}. 

\begin{equation}
y_{i}=h_{0}+h_{1}x1_{i}+h_{2}x2_{i}+h_{3}\{x1_{i}\times x2_{i}-\widehat{x1_{i}\times x2_{i}}\}+e3_{i}\label{eq:rc1}
\end{equation}

Replacing $\widehat{x1_{i}\times x2_{i}}$ with $\hat{d}_{0}+\hat{d}_{1}x1_{i}+\hat{d}_{2}x2_{i}$,
\ref{eq:rc1} becomes

\begin{eqnarray}
y_{i} & = & h_{0}+h_{1}x1_{i}+h_{2}x2_{i}+h_{3}\{x1_{i}\times x2_{i}-\hat{d}_{0}-\hat{d}_{1}x1_{i}-\hat{d}_{2}x2_{i}\}+e3_{i}\\
 & = & h_{0}+h_{1}x1_{i}+h_{2}x2_{i}+h_{3}\{x1_{i}\times x2_{i}\}-h_{3}\hat{d}_{0}-h_{3}\hat{d}_{1}x1_{i}-h_{3}\hat{d}_{2}x2_{i}\}+e3_{i}\\
 &  & \{h_{0}-h_{3}\hat{d}_{0}\}+\{h_{1}-h_{3}\hat{d}_{1}\}x1_{i}+\{h_{2}-h_{3}\hat{d}_{2}\}x2_{i}+h_{3}\{x1_{i}\times x2_{i}\}+e3_{i}
\end{eqnarray}

As in the previous comparison, we can translate coefficient estimates
between the ordinary specification and the residual-centered model.
The coefficient estimated for the product term, $\hat{h}_{3}$, should
be equal to $\hat{b}_{3}$ and $\hat{c}_{3}$ (and it is!). If we
fit the residual centered model, \ref{eq:rc1}, we can re-generate
the coefficients of the other models like so: 
\begin{eqnarray}
\hat{b}_{0} & =\hat{c}_{0}-\hat{c}_{1}\overline{x1}-\hat{c}_{2}\overline{x2}+\hat{c}_{3}\overline{x1}\overline{x2}= & h_{0}-h_{3}\hat{d}_{0}\\
\hat{b}_{1} & =\hat{c}_{1}-\hat{c}_{3}\overline{x2}= & h_{1}-h_{3}\hat{d}_{1}\\
\hat{b}_{2} & =\hat{c}_{2}-\hat{c}_{3}\overline{x1}= & h_{2}-h_{3}\hat{d}_{2}
\end{eqnarray}

From the preceding, it should be clear enough that the three models
are equivalent.

\section{Conclusion}

The rockchalk package is offered as a system of support for teachers
and students of regression analysis. It should help with the preparation
of plots, summary tables, and other diagnostics. 

A number of functions are currently offered in this package that have
not been emphasized in this writeup. I would draw the reader to the
help pages for these functions
\begin{description}
\item [{combineLevels}] a recoder for factor variables
\item [{mcDiagnose}] a one stop shop for multicollinearity diagnostic information
\item [{getDeltaRsquare}] calculate the change in the $R^{2}$ that results
from the omission of each variable. This is the squared semi-partial
correlation coefficient.
\item [{getPartialCor}] calculates the partial correlation matrix of the
predictors in a regression model
\item [{lazyCor~and~lazyCov}] convenient ways to create correlation and
covariance matrices that are needed in many simulation exercises
\item [{mvrnorm}] a slightly improved version of the MASS package's multivariate
normal generator.
\end{description}
\bibliographystyle{chicago}
\bibliography{rockchalk}

\end{document}
