%% LyX 2.4.4 created this file. For more info, see https://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[american,noae]{scrartcl} \usepackage{lmodern} \renewcommand{\sfdefault}{lmss} \renewcommand{\ttdefault}{lmtt} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{babel} \usepackage{url} \usepackage{enumitem} \usepackage{geometry} \geometry{verbose,tmargin=1in,bmargin=1in,lmargin=1in,rmargin=1in} \usepackage[authoryear]{natbib} \usepackage[pdfusetitle, bookmarks=true,bookmarksnumbered=false,bookmarksopen=false, breaklinks=false,pdfborder={0 0 1},backref=section,colorlinks=false] {hyperref} \makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands. <>= if(exists(".orig.enc")) options(encoding = .orig.enc) @ \newlength{\lyxlabelwidth} % auxiliary length \providecommand*{\code}[1]{\texttt{#1}} \providecommand*{\strong}[1]{\textbf{#1}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. %\VignetteIndexEntry{Rchaeology} \usepackage{Sweavel} \usepackage{graphicx} \usepackage{color} \usepackage{babel} \usepackage[samesize]{cancel} \usepackage{ifthen} \makeatletter \renewenvironment{figure}[1][]{% \ifthenelse{\equal{#1}{}}{% \@float{figure} }{% \@float{figure}[#1]% }% \centering }{% \end@float } \renewenvironment{table}[1][]{% \ifthenelse{\equal{#1}{}}{% \@float{table} }{% \@float{table}[#1]% }% \centering % \setlength{\@tempdima}{\abovecaptionskip}% % \setlength{\abovecaptionskip}{\belowcaptionskip}% % \setlength{\belowcaptionskip}{\@tempdima}% }{% \end@float } %\usepackage{listings} % Make ordinary listings look as if they come from Sweave \lstset{tabsize=2, breaklines=true,style=Rstyle} % In document Latex options: \fvset{listparameters={\setlength{\topsep}{0em}}} \def\Sweavesize{\scriptsize} \def\Rcolor{\color{black}} \def\Rbackground{\color[gray]{0.90}} \usepackage{babel} \usepackage[samesize]{cancel} \usepackage{ifthen} \usepackage{listings}% Make ordinary listings look as if they come from Sweave \lstset{tabsize=2, breaklines=true, style=Rstyle} % Added by lyx2lyx \setlength{\parskip}{\smallskipamount} \setlength{\parindent}{0pt} \makeatother \usepackage{listings} \renewcommand{\lstlistingname}{\protect\inputencoding{latin9}Listing} \begin{document} \title{Rchaeology: Idioms of R Programming} \author{Paul E. Johnson } \maketitle This document was initiated on May 31, 2012. The newest copy will always be available at \url{https://pj.freefaculty.org/R/} and as a vignette in the R package ``rockchalk''. \begin{description} \item [{Rchaeology:}] The study of R programming by investigation of R source code. It is the effort to discern the programming strategies, idioms, and style of R programmers in order to better communicate with them. \item [{Rchaeologist:}] One who practices Rchaeology. \end{description} These are Rcheological observations about the style and mannerisms of R programmers in their native habitats. Almost all of the insights here are gathered from the r-help and r-devel emails lists, the stackoverflow website pages for R, and the R source code itself. These are lessons from the ``school of hard knocks.'' How is this different from Rtips(\url{https://pj.freefaculty.org/R/Rtips.pdf})? \begin{enumerate} \item This is oriented toward programming R, rather than using R. \item It is more synthetic, aimed more at finding ``what's right'' rather than ``what works.'' \item It is written with Sweave (using Harrell's Sweavel style) so that code examples work. \end{enumerate} Where did the ``R Style'' section go? It was removed to a separate vignette, RStyle, in the rockchalk package. \tableofcontents{} <>= dir.create("plots", showWarnings=F) @ % In document Latex options: \fvset{listparameters={\setlength{\topsep}{0em}}} \SweaveOpts{prefix.string=plots/plot,ae=F,height=4,width=6} <>= options(width=100, continue=" ") options(useFancyQuotes = FALSE) set.seed(12345) pdf.options(onefile=F,family="Times",pointsize=12) @ \section{Introduction: R Idioms.} This vignette is about the R idioms I have learned while working on the rockchalk package. At the current time, I don't understand all of the R idioms that are common in the advanced R programmers' conversation, but I am getting some traction. There is a language gap between an R user and an R programmer. Users write ``scripts'' that use functions from R packages. Users don't write (many) functions. Users don't make packages. And many users are happy to keep it that way. For users who want to become programmers, there is usually a harsh awakening. R for development is a different language. Well, that's wrong. It is a different dialect. A transitional R user will have to learn a lot of terminology and undergo a change of paradigm. There are resources available! Everybody should subscribe to the email lists for the R project (\url{https://www.r-project.org/mail.html}), especially r-help (for user questions) and r-devel. Rchaeologically speaking, that is the native habitat of the R programmers. There is also a burgeoning collection of blogs and Web forums, perhaps most notably the R section on StackExchange (\url{https://stackoverflow.com/questions/tagged/r}). Excellent books have been published. As an Rcheologist, I am drawn to the books that are written by the R insiders. \emph{S Programming}, by William Venables and Brian Ripley \citeyearpar{venables_s_2000}, is a classic. Books by pioneering developers Robert Gentleman, \emph{R Programming for Bioinformatics} (2009), and John Chambers, \emph{Software for Data Analysis} \citeyearpar{chambers_software_2008}, are, well, awesome. In 2012, I started teaching R programming using Norman Matloff's, \emph{The Art of R Programming} (2012), which I think is great and recommend strongly (even though Professor Matloff is not one of the R Core Team members, so far as I know). There are now three types of object-oriented programming in R (S3, S4, and reference classes) and the programmer is apparently free to select among them without prejudice. The best brief explanation of S3 that I've found is in Friedrich Leisch's brief note about R packaging, ``Creating R Packages: A Tutorial'' (2009). One should be mindful of the fact that R is provided with several manuals, one of which is the \emph{R Language Definition}. I find that one difficult to understand, and I usually can't understand it until I search through the R source code and packages for usage examples. I usually approach R coding in three phases. I write code that ``works'', however tedious and slow it might be. There will be a lot of ``copy and paste'' constructions with tedious, manual editing of commands. Then I read through the code and look for repetitious elements, and I re-organize to make functions that abstract that work. If there are any stanzas that look mostly the same, except that ``x1'' is repalced by ``x2'', I know the work is not done. Finally, I look through the code to find ``dumb'' constructions that can be cleaned up and completed with fewer lines. A dumb construction is one that I would be ashamed to show to an expert R programmer. Beginning R programmers, the ones who have never studied C or Java or Fortran, will often become ``stuck'' in stage one of that process. Experienced programmers with more formal training will usually have the ``little voice'' in the back of their minds saying ``there's a better way,'' and, unfortunately, R novices don't hear that voice. Lacking some of the discipline imposed by those other languages, the usual ``rambling'' R script will seem ``good enough.'' The first step for the transitional R programmer, the one who does not want to be a novice any more, is to take a harsh look at the code that has already been written. The rockchalk package has many functions that receive fitted regression models and re-arrange their components for re-fitting or plotting. A good deal of the advice I have to offer is about the process of managing R formula, but I have also accumulated some lessons in argument handling. \section{do.call(), eval(), substitute(), formula().} If the transitional programmer understands these four functions, she/he will be well on the way to escaping the category of novice R programmer. The other benefit is that the code written by the experts, including the R source code itself, will become much more understandable. \subsection{Rewriting Formulas. My Introductory Puzzle.} On May 29, 2012, I was working on a regression problem in the rockchalk package. I have a number of functions that receive elementary regressions and then change them. I need to receive a fitted model, extract the formula, change some variables, and then revise the formula to match the new variables. And then run the model over again. The functions meanCenter() and residualCenter() receive a fitted regression model, transform some variables, and then fit a new regression. Suppose a regression has been fitted as ``\code{y \textasciitilde{} x1{*}x2}''. The R result will estimate a predictive formula such as $\hat{y}_{i}=\hat{\beta}_{0}+\hat{\beta}_{1}x1_{i}+\hat{\beta}_{2}x2_{i}++\hat{\beta}_{3}x1_{i}\cdot x2_{i}$. meanCenter() will replace the non-centered variables ``x1'' and ``x2'' with mean-centered variables ``x1c'' and ``x2c''. The original formula y \textasciitilde{} x1{*}x2 must be replaced with y \textasciitilde{} x1c{*}x2c. I found this to be a very complicated problem with a very satisfying answer. My first effort used R's update function. That is the most obvious approach. I learned very quickly that update() is not sufficient. It is fairly easy to replace x1 with x1c in the formula, but not when x1 is logged or otherwise transformed. Here is the runable example code that demonstrates the problem. <>= dat <- data.frame(x1 = rnorm(100, m = 50), x2 = rnorm(100, m = 50), x3 = rnorm(100, m = 50), x4 = rnorm(100, m=50), y = rnorm(100)) m2 <- lm(y ~ log(x1) + x2*x3, data = dat) suffixX <- function(fmla, x, s){ upform <- as.formula(paste(". ~ .", "-", x, "+", paste(x, s, sep = ""), sep="", collapse=" ")) update.formula(fmla, upform) } newFmla <- formula(m2) newFmla suffixX(newFmla, "x2", "c") suffixX(newFmla, "x1", "c") @ Run that and check the last few lines of the output. See how the update misses x1 inside log(x1) or in the interaction? \bgroup\inputencoding{latin9} \begin{lstlisting} > newFmla <- formula(m2) > newFmla y ~ log(x1) + x2 * x3 > suffixX(newFmla, "x2", "c") y ~ log(x1) + x3 + x2c + x2:x3 > suffixX(newFmla, "x1", "c") y ~ log(x1) + x2 + x3 + x1c + x2:x3 \end{lstlisting} \leavevmode\egroup I asked the members of r-help for assistance. Lately I've had very good luck with r-help. Gabor Grothendieck wrote an answer to r-help on May 29, 2012. He said simply, ``Try substitute,'' with this example. \bgroup\inputencoding{latin9} \begin{lstlisting} > do.call("substitute", list(newFmla, setNames(list(as.name("x1c")), "x1"))) y ~ log(x1c) + x2 * x3 \end{lstlisting} \leavevmode\egroup Problem solved, in a single line. That's very clever, I think. It packs together a half-dozen very deep thoughts. It has most of the essential secrets of R's guts, laid out in a single line. It has do.call(), substitute(), it interprets a formula as a list, and it shows that every command in R is, when it comes down to brass tacks, a list. I would like to take up these separate pieces in order. \subsection{A Formula Object is a List.} While struggling with this, I noticed this really interesting pattern. The solution depends on it. The object ``newFmla'' is not just a text string. It prints out as if it were text, but it is actually an R list object. Its parts can be probed recursively, to eventually reveal all of the individual pieces: <>= newFmla newFmla[[1]] newFmla[[2]] newFmla[[3]] newFmla[[3]][[2]] newFmla[[3]][[2]][[2]] @ How could I put that information to use? Read on. \subsection{do.call() and eval()} In my early work as an Rchaeologist, I had noticed eval() and do.call(), but did not understand their significance. Coming to grips with these ideas is a critical step in the R programmer's growth, because they separate the ``script writer'' from the ``language programmer.'' Whenever difficult problems arise in r-help, the answer almost invariably involves do.call() or eval(). \subsubsection{do.call()} Let's concentrate on do.call first. The syntax is like this \bgroup\inputencoding{latin9} \begin{lstlisting} do.call("someRFunction", aListOfArgumentsToGoInTheParentheses) \end{lstlisting} \leavevmode\egroup It is as if we were telling R to run this: \bgroup\inputencoding{latin9} \begin{lstlisting} someRFunction(aListOfArgumentsToGoInTheParentheses) \end{lstlisting} \leavevmode\egroup We use do.call() because it is much more flexible than calling someRFunction() directly. Let's consider an example that runs a regression the ordinary way, and then with do.call. In this example, the role of ``someRFunction'' will be played by lm() and the list of arguments will be the parameters of the regression. The regression m1 will be constructed the ordinary way, while m2 is constructed with do.call(). <<>>= m1 <- lm(y ~ x1*x2, data = dat) coef(m1) regargs <- list(formula = y ~ x1*x2, data = quote(dat)) m2 <- do.call("lm", regargs) coef(m2) all.equal(m1, m2) @ The object regargs is a list of arguments that R can understand when they are supplied to the lm function. do.call() is a powerful, mysterious symbol. It holds flexibility; we can calculate ``on the language'' to create commands and then run them. I first needed do.call() when we had a simulation project that ran very slowly. There's a writeup in the working examples distributed with rockchalk called stackListItems-01.R. I was using rbind() over and over to join the results of simulation runs. Basically, the code was like this \bgroup\inputencoding{latin9} \begin{lstlisting} for (i in 1:10000){ dat <- someHugeSimulation(i) result <- rbind(result, dat) } \end{lstlisting} \leavevmode\egroup That will call rbind() 10000 times. I had not realized that rbind() is time-consuming. It accesses a new chunk of memory each time it is run. On the other hand, we could collect those results in a list, then we can call rbind() one time to smash together all of the results. \bgroup\inputencoding{latin9} \begin{lstlisting} for (i in 1:10000){ mylist[[i]] <- someHugeSimulation(i) } result <- do.call("rbind", mylist) \end{lstlisting} \leavevmode\egroup It is much faster to run rbind() only once. It would be OK if we typed it all out like this: \bgroup\inputencoding{latin9} \begin{lstlisting} result <- rbind(mylist[[1]], mylist[[2]], mylist[[3]], ..., mylist[[10000]) \end{lstlisting} \leavevmode\egroup But who wants to do all of that typing? How tiresome! Thanks to Erik Iverson in r-help, I understand that \bgroup\inputencoding{latin9} \begin{lstlisting} result <- do.call("rbind", mylist) \end{lstlisting} \leavevmode\egroup is doing the EXACT same thing. ``mylist'' is a list of arguments. do.call is \emph{constructing} a function call from the list of arguments. It is \emph{as if} I had actually typed rbind with 10000 arguments. The beauty in this is that we could design a program that can assemble the list of arguments, and also choose the function to be run, on the fly. We are not required to literally write the function in quotes, as in ``rbind''. We could instead have a variable that is calculated to select one function among many, and then use do.call on that. In a very real sense, we could write a program that can write itself as it runs. From all of this (and a peek at ?call), I arrive at an Rchaeological eureka! A call object is a quoted command plus a list of arguments for that command. \subsubsection{eval()} Where does eval() fit into the picture? do.call() manufactures a call and executes it immediately. It is possible to arrest the process mid stream, to capture the call without evaluating it with eval(). The terminology in the R documentation on this is difficult, mainly because there are several different ways to create a piece of not-yet-evaluated code. By \strong{not-yet-evaluated code}, I mean an ``unevaluated expression'' or a ``call object'', anything that can be handled by the eval() function. A not-yet-evaluated object holds some syntax that can we need later. I think of do.call() as a contraction of ``eval'' and ``call''. As far as I can tell, do.call(\textquotedbl lm\textquotedbl , list(arguments)) is the same as eval(call(``lm'', arguments)). There are many ways to ask R to create the not-yet-evaluated object, that's one of the confusing things (in my opinion). Most obviously, call() function can be used for that. But functions like quote() and expression() can also create not-yet-evaluated code. Everybody has experienced an error like this: \bgroup\inputencoding{latin9} \begin{lstlisting} > x1 + x2 Error: object 'x1' not found \end{lstlisting} \leavevmode\egroup Whenever we use a variable, the R interpreter expects to find it. We know those variables are in the data frame dat, but the interpreter does not know that's what we mean. But we can ask R to trust us with the quote() function, which, basically, means ``here's some code we will evaluate later'': \bgroup\inputencoding{latin9} \begin{lstlisting} > mycall <- quote(x1 + x2) \end{lstlisting} \leavevmode\egroup The object ``mycall'' is an R call object, something that might be evaluated in the future. When we want to evaluate ``x1 + x2'', we ask R do to so: \bgroup\inputencoding{latin9} \begin{lstlisting} eval(mycall, dat) \end{lstlisting} \leavevmode\egroup The second argument supplies the place where x1 and x2 are to be found. R users may never have noticed the eval() function, but most will have used the with() function, which is simply a clever re-phrasing of the eval function's interface. It does the same thing:\bgroup\inputencoding{latin9} \begin{lstlisting} with(dat, x1 + x2) \end{lstlisting} \leavevmode\egroup and so does \bgroup\inputencoding{latin9} \begin{lstlisting} with(dat, eval(mycall)). \end{lstlisting} \leavevmode\egroup Now, back to the regression example we were working on before. We can create the call object using the call() function, which requires the name of a function and a comma-separated list of quoted arguments. I quote each of the bits of syntax in the arguments because I don't want R to replace ``dat'' with the actual data frame, I only want it to remember the name. <<>>= m3 <- lm(y ~ x1*x2, data = dat) coef(m3) mycall <- call("lm", quote(y ~ x1*x2), data = quote(dat)) m4 <- eval(mycall) coef(m4) @ The main reason for using eval is that we can ``piece together'' commands and then run them after we have assembled all the pieces. The nickname for this seems to be ``computing on the language.'' There is a discussion of it in the R Language Definition. It is also surveyed in Hadley Wickham's new book, \emph{Advanced R} \citeyearpar{Wickham2015}. Did you ever notice that you get an error if you run ``x1 + x2'', but if you run ``y $\sim$ x1 + x2'', then there is no error? It is as if R is wrapping your code in protective tissue, something like quote(). However, the protective wrapper in this case is actually the function called formula(). The R interpreter notices the symbol $\sim$ in the syntax, and so it assumes you meant to run \bgroup\inputencoding{latin9} \begin{lstlisting} formula(y ~ x1 + x2) \end{lstlisting} \leavevmode\egroup A formula is another kind of not-yet-evaluated thing. It is something that we can fiddle with before evaluating. Create a complicated formula. Note that the lm function receives that formula object without trouble. <<>>= f1 <- y ~ x1 + x2 + x3 + log(x4) class(f1) m5 <- lm(f1, data = dat) coef(m5) @ The object f1 is a formula object. Its not just a text string. Observe it has separate pieces, just like newFmla in the example problem that started this section. <<>>= f1[[1]] f1[[2]] f1[[3]] f1[[3]][[1]] f1[[3]][[2]] f1[[3]][[3]] @\\ Note that f1 created in this way must be a syntactically valid R formula; it cannot include any other regression options. \bgroup\inputencoding{latin9} \begin{lstlisting} > f1 <- y ~ x1 + x2 + x3 + log(x4), data=dat Error: unexpected ',' in "f1 <- y ~ x1 + x2 + x3 + log(x4)," \end{lstlisting} \leavevmode\egroup If I declare f1exp as an expression, then R does not re-interpret it as a formula (f1exp is an unevaluated expression, the R parser has not translated it yet). To use that as a formula in the regression, we have to evaluate it. <<>>= f1exp <- expression(y ~ x1 + x2 + x3 + log(x4)) class(f1exp) m6 <- lm(eval(f1exp), data = dat) @ When f1exp is evaluated, what do we have? Here's the answer. <<>>= f1expeval <- eval(f1exp) class(f1expeval) all.equal(f1expeval, f1) m7 <- lm(f1expeval, data=dat) all.equal(coef(m6), coef(m7)) @ The point here is that the pieces of an ordinary use command can be separated and put back together again before the work of doing calculations begins. We can edit the formula. Suppose we replace a part: <<>>= f1[[3]][[2]] <- quote(x1 + log(x2)) m8 <- lm(f1, data = dat) coef(m8) @ Now we turn back to the main theme. How is eval() used in functions? Some functions take a lot of arguments. We may want to do a lot of fine tuning on the arguments, before they are assembled and evaluated. Hence, it is very important that we can keep some bits of R that are symbolic, rather than calculated, until we need them. If you are looking for examples of eval() in the R source code, there is a nice one in the beginning of the lm() function. Suppose a user submits a command like ``lm(y \textasciitilde{} x, data = dat, x = TRUE, y = TRUE).'' Inside lm(), the import is re-organized. Here are the first lines of the lm() function \bgroup\inputencoding{latin9} \begin{lstlisting}[breaklines=true,numbers=left,numberstyle={\footnotesize},tabsize=4] lm <- function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...) { ret.x <- x ret.y <- y cl <- match.call() mf <- match.call(expand.dots = FALSE) m <- match(c("formula", "data", "subset", "weights", "na.action", "offset"), names(mf), 0L) mf <- mf[c(1L, m)] mf$drop.unused.levels <- TRUE mf[[1L]] <- as.name("model.frame") mf <- eval(mf, parent.frame()) \end{lstlisting} \leavevmode\egroup Now suppose we turn on the debugger and run a regression in R with a command like this. \bgroup\inputencoding{latin9} \begin{lstlisting} debug(lm) m1 <- lm(y ~ x1 * x2, data = dat, x = TRUE, y = TRUE) \end{lstlisting} \leavevmode\egroup After that, we are ``in'' the function, stepping through line-by-line. In line 8, the match.call() function is used to grab a copy of the command that I typed. We can see that mf is exactly the same as my command, except R has named the arguments: \bgroup\inputencoding{latin9} \begin{lstlisting} > mf lm(formula = y ~ x1 * x2, data = dat, x = TRUE, y = TRUE) \end{lstlisting} \leavevmode\egroup That's not just a string of letters, however. It is a call object, a list with individual pieces that can be revised. Recall that the first element in a call object is the name of a function, and the following elements are the arguments. Lines 10 and 11 check the names of mf for the presence of certain arguments, and throw away the rest. It only wants the arguments we would be needed to run the function model.frame. Line 12 adds an argument to the list, drop.unused.levels. Up to that point, then, we can look at the individual pieces of mf: \bgroup\inputencoding{latin9} \begin{lstlisting} > names(mf) [1] "" "formula" "data" [4] "drop.unused.levels" > mf[[1]] lm > mf[[2]] y ~ x1 * x2 > mf[[3]] dat > mf[[4]] [1] TRUE \end{lstlisting} \leavevmode\egroup The object mf has separate pieces that can be revised and then evaluated. Line 13 replaces the element 1 in mf with the symbol ``model.frame''. That's the function that will be called. Line 14 is the coup de grâce, when the revised call ``mf'' is sent to eval. In the end, it is \emph{as if} lm had directly submitted the command \bgroup\inputencoding{latin9} \begin{lstlisting} mf <- model.frame(y ~ x1 * x2, data = dat, drop.unused.levels = TRUE) \end{lstlisting} \leavevmode\egroup It would not do to simply write that into the lm function, however, because some people use variables that have names different from y, x1, and x2, and their data objects may not be called dat. lm allows users to input whatever they want for a formula and data, and then lm takes what it needs to build a model frame. \subsection{substitute()} Most R users I know have not used substitute, except possibly if they try to use plotmath. In the context of plotmath, the problem is as follows. Plotmath causes the R plot functions to convert expressions into mathematical symbols in a way that is reminiscent of \LaTeX . For example, a command like this: \bgroup\inputencoding{latin9} \begin{lstlisting} text(4, 4, expression(gamma)) \end{lstlisting} \leavevmode\egroup will draw the gamma symbol at the position (4,4). We can use paste to combine symbolic commands and text like so: \bgroup\inputencoding{latin9} \begin{lstlisting} text(4, 4, expression(paste(gamma, " = 7"))) \end{lstlisting} \leavevmode\egroup The number 7 is a nice number, but what if we want to calculate something and insert it into the expression? Your first guess might be to insert a function that makes a calculation, such as the mean, but this fails: \bgroup\inputencoding{latin9} \begin{lstlisting} text(4, 4, expression(paste(gamma, mean(x)))) \end{lstlisting} \leavevmode\egroup In order to smuggle the result of a calculation into an expression, some fancy footwork is required. In the help page for plotmath, examples using the functions bquote and substitute are offered. I have found this to be quite frustrating and difficult to do, and if one examines my collection of examples on \url{https://pj.freefaculty.org/R/WorkingExamples/}, one will find quite a few plotmath exercises. For the particular purpose of blending expressions with calculation results, I find the bquote function to be more easily understandable. However, when going beyond plotmath, I expect most R programmers will need to use substitute() instead, so I will discuss that in this section. The plotmath help page points to syntax like this: <>= plot(1:10, seq(1,5, length.out=10), type = "n", main="Illustrating Substitute with plotmath", xlab="x", ylab="y") text(5, 4, substitute(gamma + x1mean, list(x1mean = mean(dat$x1)))) text(5, 2, expression(paste(gamma, " plus the mean of x1"))) @ Run ?substitute and one is brought to a famous piece of Rchaeological pottery: \begin{quote} ‘substitute’ returns the parse tree for the (unevaluated) expression ‘expr’, substituting any variables bound in ‘env’. \end{quote} Pardon me. parse tree? We've seen expressions already, that part is not so off putting. But ``parse tree''? Really? This is one of those points at which being an Rchaeologist has real benefits. We have to dig deeper, to try to understand not only what the R programmer says, but what she is actually trying to do. The manual page gives us some insights into the R programmer, and it is his or her view of his or her own actions, but it doesn't necessarily speak to how we should understand substitute(). For me, the only workable approach is to build up a sequence of increasingly complicated examples. I start by creating the list of replacements. This replacement list can have a format like this: <<>>= sublist <- list(x1 = "alphabet", x2 = "zoology") @ Suppose I have some other object in which x1 and x2 need to be replaced. Specifically, when an expression has ``x1'', I want ``alphabet'', and ``x2'' should become ``zoology.'' The quotes in the R code indicate that alphabet and zoology are character strings, not other objects that already exist. Consider replacing x1 and x2 in x1 + x2 + log(x1) + x3: <<>>= substitute(expression(x1 + x2 + log(x1) + x3), sublist) @ Note that the substitution 1) leaves other variables alone (since they are not named in sublist) and 2) it finds all valid uses of the symbols x1 and x2 and replaces them. This isn't quite what I wanted, however, because the strings have been inserted into the middle of my expression. I don't want text. I just want symbols. I've seen the conversion from character to symbol done with as.symbol() and with as.name(), recently I realized that they are synonymous. I usually use as.symbol() because that name has more intuition for me, but in Gabor's answer to my question, as.name() is used. Using as.name(), my example would become: <<>>= sublist <- list(x1 = as.name("alphabet"), x2 = as.name("zoology")) substitute(expression(x1 + x2 + log(x1) + x3), sublist) @ \subsection{setNames and names} Almost every R user has noticed that the elements of R lists can have names. In a data frame, the names of the list elements are thought of as variable names, or column names. If dat is a data frame, the names and colnames functions return the same thing, but that's not true for other types of objects. <<>>= dat <- data.frame(x1=1:10, x2=10:1, x3=rep(1:5,2), x4=gl(2,5)) colnames(dat) names(dat) @ After dat is created, we can change the names inside it with a very similar approach: <<>>= newnames <- c("whatever","sounds","good","tome") colnames(dat) <- newnames colnames(dat) @ While used interactively, this is convenient, but it is a bit tedious because we have to create dat first, and then set the names. The setNames() function allows us to do this in one shot. I'll paste the data frame creating commands and the name vector in for a first try: <<>>= dat2 <- setNames(data.frame(x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10), x4 = gl(2,5)), c("good", "names", "tough", "find")) head(dat2, 2) @ In order to make this more generally useful, the first step is to take the data-frame-creating code and set it into an expression that is not immediately evaluated (that's datcommand in what follows). When I want the data frame to be created, I use eval(), and then the newnames vector is put to use. <<>>= newnames <- c("iVar", "uVar", "heVar", "sheVar") datcommand <- expression(data.frame(x1=1:10, x2=10:1, x3=rep(1:5,2), x4=gl(2,5))) eval(datcommand) dat3 <- setNames(eval(datcommand), newnames) @ The whole point of this exercise is that we can write code that creates the names, and creates the data frame, and then they all come together. What if we have just one element in a list? In Gabor's answer to my question, there is this idiom \bgroup\inputencoding{latin9} \begin{lstlisting} setNames(list(as.name("x1c")), "x1"))) \end{lstlisting} \leavevmode\egroup Consider this from the inside out. \begin{enumerate} \item as.name(``x1c'') is an R symbol object, \item list(as.name(``x1c'')) is a list with just one object, which is that symbol object. \item Use setNames(). The object has no name! We would like to name it ``x1''. \end{enumerate} It is as if we had run the command list(x1 = x1c). The big difference, of course, is that this way is much more flexible because we can calculate replacements. \subsection{The Big Finish} In the meanCenter() function in rockchalk, some predictors are mean-centered and their names are revised. A variable named ``age'' becomes ``agec'' or ``x1'' becomes ``x1c''. So the user's regression formula that uses variables age or x1 must be revised. This is a function that takes a formula ``fmla'' and replaces a symbol xname with newname. \bgroup\inputencoding{latin9} \begin{lstlisting} formulaReplace <- function(fmla, xname, newname){ do.call("substitute", list(fmla, setNames(list(as.name(newname)), xname))) } \end{lstlisting} \leavevmode\egroup This is put to use in meanCenter(). Suppose a vector of variable names called nc (stands for ``needs centering'') has already been calculated. The function std() creates a centered variable. \bgroup\inputencoding{latin9} \begin{lstlisting} newFmla <- mc$formula for (i in seq_along(nc)){ icenter <- std(stddat[, nc[i]]) newname <- paste(as.character(nc[i]), "c", sep = "") newFmla <- formulaReplace(newFmla, as.character(nc[i]), newname) nc[i] <- newname } \end{lstlisting} \leavevmode\egroup If one has a copy of rockchalk 1.6 or newer, the evidence of the success of this approach should be evident in the output of the command example(meanCenter). \section{Make Re-Usable Tools (Rather than Cutting and Pasting)} We seek concise solutions that are generalizable. It is almost never right to cut and paste and then make minor edits in each copy to achieve particular purposes. This advice goes against the grain of the graduate students that I work with. They will almost always use cut and paste solutions. Here is an example of the problem. A person needed ``dummy variables''. The code had several pages like this: \bgroup\inputencoding{latin9} \begin{lstlisting} if(setcorr == 1){ corr.10 <-1 corr.20 <-0 corr.30 <-0 corr.40 <-0 corr.50 <-0 corr.60 <-0 corr.70 <-0 corr.80 <-0 } if(setcorr == 2){ corr.10 <-1 corr.20 <-1 corr.30 <-0 corr.40 <-0 corr.50 <-0 corr.60 <-0 corr.70 <-0 corr.80 <-0 } if(setcorr == 3){ corr.10 <-1 corr.20 <-1 corr.30 <-1 corr.40 <-0 corr.50 <-0 corr.60 <-0 corr.70 <-0 corr.80 <-0 } \end{lstlisting} \leavevmode\egroup Well, that's understandable, but a little bit embarrassing. I asked ``why do you declare these separate variables, why not make a vector?'' and ``can't you see a more succinct way to declare those things?'' The answer is ``we do it that way in SAS'' or ``this runs''. One might suspect that the author was eager to say, ``my project is based on 10,000 lines of R code'' in order to impress a lay audience. If the mission is to create a vector with a certain number of 1's at the beginning, surely either of these functions would be better: <<>>= biVec1 <- function(n = 3, n1s = 1) { c(rep(1, n1s), rep(0, n - n1s)) } biVec2 <- function(n = 3, n1s = 1){ x <- numeric(length = n) x[1:n1s] <- 1 x } (corr <- biVec1(n = 8, n1s = 3)) (corr <- biVec2(n = 8, n1s = 3)) @ The cut and paste way is not wrong, exactly. Its tedious. I think any reasonable user should want a vector, rather than the separate variables. But suppose the user is determined, and really wants the individual variables named corr.10, corr.20, and so forth. We can re-design this so that those variables will be sitting out in the workspace after the function is run. Read help(``assign'') and try <<>>= biVec3 <- function(n = 3, n1s = 1) { X <- c(rep(1, n1s), rep(0, n - n1s)) xnam <- paste("corr.", 1:8, "0", sep = "") for(i in 1:n) assign(xnam[i], X[i], envir = .GlobalEnv) } ls() biVec3(n = 8, n1s = 3) ls() @ I don't think most people will actually want that kind of result, but if they do, there is a way to get it. If we want to turn biVec1() or biVec2() into general purpose functions, we need to start thinking about the problem of unexpected user input. Notice what happens if we ask for more 1's than the result vector is supposed to contain: \bgroup\inputencoding{latin9} \begin{lstlisting} > biVec1(n = 3, n1s = 7) Error in rep(0, n - n1s) (from #2) : invalid 'times' argument > biVec2(n = 3, n1s = 7) [1] 1 1 1 1 1 1 1 \end{lstlisting} \leavevmode\egroup Perhaps I prefer biVec1() because it throws an error when there is unreasonable input, while biVec2() returns a longer vector than expected with no error. If the user mistakenly enters non-integers, neither function generates an error, and they give us unexpected output. <<>>= biVec1(3.3, 1.8) biVec2(3.3, 1.8) @ To my surprise, it is not an easy thing to ask a number if it is a whole number (see help(``is.integer'')). For that, we introduce a new function, is.wholenumber(), and put it to use in our new and improved function <<>>= is.wholenumber <- function(x, tol = .Machine$double.eps^0.5){ abs(x - round(x)) < tol } biVec1 <- function(n = 3, n1s = 1) { if(!(is.wholenumber(n) & is.wholenumber(n1s))) stop("Both n and n1s must be whole numbers (integers)") if(n1s > n) stop("n must be greater than or equal to n1s") c(rep(1, n1s), rep(0, n - n1s)) } @ In all of the test cases I've tried, that works, although the real numbers 7.0 and 4.0 are able to masquerade as integers. \bgroup\inputencoding{latin9} \begin{lstlisting} > biVec1(3, 7) Error in biVec1(3, 7) (from #4) : n must be greater than or equal to n1s > biVec1(7, 3) [1] 1 1 1 0 0 0 0 > biVec1(3.3, 4.4) Error in biVec1(3.3, 4.4) (from #2) : Both n and n1s must be whole numbers (integers) > biVec1(7.0, 4.0) [1] 1 1 1 1 0 0 0 > biVec1(7.0, 4.0) [1] 1 1 1 1 0 0 0 \end{lstlisting} \leavevmode\egroup \section{Function Arguments.} While developing functions for the rockchalk package, one of the most important learning experiences I've had is in argument management. One of the reasons that it is difficult to design functions is that there is no ``mandatory manual'' on this. The R framework allows programmer creativity, which is a good thing. But it is also a bad thing, since every programmer has his/her own opinion about what ought to be done. To state the obvious, argument names should be clear, unambiguous, convenient, short, and minimally necessary. Beyond that, what can we say? I urge new R programmers to study the R source code. This is part of the Rchaeological method. Study the most frequently used functions in the R distribution. I've learned the most from the way arguments are handled in the functions \code{hist}(), \code{lm()}, \code{plot()} and \code{termplot()}. These functions have had the most eyes on them and, therefore, they are most likely to use the best approach. I wrote several functions that eventually went into rockchalk before I realized that important fact and I regret it (on a weekly basis). In this section, I offer this advice. \begin{enumerate} \item Protect your function's calculations from the user's workspace. \item Check argument values. \item Design the function so that it runs with a minimum number of arguments. \end{enumerate} \subsection{A few tidits about arguments} R includes several functions that are very helpful in dealing with arguments. \begin{description} \item [{missing()}] Inside a function, such as \code{plotme()} we often need to check if the user supplied a value for an argument, such as \code{col}. If the user did not supply a value, we might construct one from the other input \bgroup\inputencoding{latin9} \begin{lstlisting} if (missing(col)) col <- someTediousCalculation(y, z) \end{lstlisting} \leavevmode\egroup This is usually good enough, but there is danger that the user might have called the function like so: \bgroup\inputencoding{latin9} \begin{lstlisting} plotme(x, y, z, col = NULL) \end{lstlisting} \leavevmode\egroup This outsmarts missing(col), which will return FALSE. If our goal is to make sure col is something, some non-NULL thing, a more careful argument check would be \bgroup\inputencoding{latin9} \begin{lstlisting} if (missing(col) | is.null(col)) col <- someTediousCalculation(y, z) \end{lstlisting} \leavevmode\egroup \item [{stop()}] Halts a function and returns a message to the user. Usage example \bgroup\inputencoding{latin9} \begin{lstlisting} if (isTRUE(x)) stop(paste("I'm sorry, you failed to supply the right information."))) \end{lstlisting} \leavevmode\egroup \item [{stopifnot()}] It is so frequent to run ``\code{if(isTRUE(x)) stop()}'' that they created a wrapper, \code{stopifnot(x)}. \item [{warning()}] If you don't want the function to fail, but you need to warn the user about a condition that seems dubious, run \bgroup\inputencoding{latin9} \begin{lstlisting} if (isTRUE(x)) warning(paste("Are you sure the value of x ought to be", x, "?")) \end{lstlisting} \leavevmode\egroup \item [{deparse(substitute())}] Inside a function, it is often important to find out the symbolic value of the argument that the user supplied to the function. I don't mean the \emph{value}, but rather the symbolic name. To see what I mean, run this <>= dat <- data.frame(myx = c(1, 2, 3), myy = c(10, 5, 1)) debug(plot.default) plot(dat$myx, dat$myy) @ After you hit enter, the R debugger/browser will come open and the prompt will allow you to inspect the situation. First, type ``n'' and hit return. That takes the first step into the calculations. After that, run \code{match.call()}. The function is somewhat ``self aware''. It knows how it was called. You should see this \bgroup\inputencoding{latin9} \begin{lstlisting} Browse[2]> n debug: localAxiis <- function(..., col, bg, pch, cex, lty, lwd) Axis(...) Browse[2]> match.call() plot.default(x = dat$myx, y = dat$myy) \end{lstlisting} \leavevmode\egroup You asked for \code{plot()}, R sends the work to \code{plot.default()}. All is well. Still inside the debugger now, type ``x'' and then ``myx'': \bgroup\inputencoding{latin9} \begin{lstlisting} Browse[2]> x [1] 1 2 3 Browse[2]> myx Error: object 'myx' not found \end{lstlisting} \leavevmode\egroup Inside the function, the vector value that we passed in, which was referred to as dat\$myx, is now called x. The thing x is a data holder, it is used in the following calculations. \code{x} does not remember that its passed-in value was named \code{dat\$myx} in a previous life. It is as if \code{x} is the new identity of the same old data vector. However, we can ask \code{x} ``what name were you known by in your previous life?'' with this interesting idiom. \bgroup\inputencoding{latin9} \begin{lstlisting} Browse[2]> deparse(substitute(x)) [1] "dat$myx" \end{lstlisting} \leavevmode\egroup The substitute() function, as we have already seen, is a way to replace a letter x with a value from somewhere. Maybe it would be more obvious to run this in 2 steps. \bgroup\inputencoding{latin9} \begin{lstlisting} Browse[2]> xsubs <- substitute(x) Browse[2]> xsubs dat$myx Browse[2]> deparse(xsubs) [1] "dat$myx" \end{lstlisting} \leavevmode\egroup \code{substitute()} grabs the object name from whence \code{x} was calculated. We need to convert that back to a character string so that we can know the symbolic name of the thing that eventually became \code{x}. This is at the outer boundary of my R comprehension, so I better stop here. Incidentally, you can go exploring in the debugger. Run some standards like \code{ls()} to see what's available. If you keep hitting return, the function will run line by line. If you get tired of hitting return, type ``c'' and the function will run to its conclusion. To let R know that you don't want to debug that function any more, run: \bgroup\inputencoding{latin9} \begin{lstlisting} > undebug(plot.default) \end{lstlisting} \leavevmode\egroup If you don't do that, then the function browser will pop open every time you try to run plot. \item [{list(...)}] The three periods, ``...'', are a valid word. (A period is a valid character). R scans user syntax for function arguments and it uses whatever it can understand (formal named arguments, positional matching), and then the other arguments are thrown into a thing called ``...''. In a simple case, we could just pass those additional arguments to another function, as in \bgroup\inputencoding{latin9} \begin{lstlisting} plotme <- function(x, y, ...){ plot(x, y, ...) } \end{lstlisting} \leavevmode\egroup Often, it is not sufficient to simply pass on the ``...'' variables, and we instead need to scan them and sort them for various purposes. Inside the function, we access those variables by turning them into a list: \begin{Sinput} dots <- list(...) \end{Sinput} We can still pass on the dots to other functions, but the usual problem is that some of the arguments in dots may be intended for several different functions. Suppose we are writing a plotter for categorical variables. We might need to receive some arguments, then run \code{table()}, then run \code{layout()}, then run \code{mosaicplot()}. The user may have put in arguments intended for those separate functions. There's a chance of an error, or at least a warning, if irrelevant arguments are passed to those arguments. The rockchalk functions \code{plotSlopes()} has examples that do this work. They separate the dot arguments that are needed for the \code{predict()} function and the \code{plot()} function. Because I was learning about arguments when I wrote those functions, I did not take the cleanest or most direct route to do this, but they do stand as working examples that users can review. A function may set a number of defaults that we want to use when the users do not supply a lot of details. In a case like that, it is useful to remember a very handy function in R called \code{modifyList()}. The basic idea is that we can build a list of default arguments for a function, and then we can scan the \code{dots} list and it will replace the defaults in the first with changes in the second. There's an example in rockchalk's \code{pctable()} function. \begin{Sinput} dots <- list(...) dotnames <- names(dots) tableargs <- list(rv, cv, dnn = c(rvlab, cvlab)) newargs <- modifyList(tableargs, dots, keep.null = TRUE) t1 <- do.call("table", newargs) \end{Sinput}The dots list, by definition, is named arguments, and it is often handy to have a copy of their names for checking whether an argument is included in the list. In the plotSlopes() function, here's what I ended up with: \begin{Sinput} validForPredict <- c("se.fit", "dispersion", "terms", "na.action", "level", "pred.var", "weights") dotsForPredict <- dotnames[dotnames %in% validForPredict] if (length(dotsForPredict) > 0) { parms <- modifyList(parms, dotargs[dotsForPredict]) dotargs[[dotsForPredict]] <- NULL } np <- do.call("predictCI", parms) \end{Sinput}This scans dotnames for elements in the list of arguments that are valid for the predict method, and after they are put onto the parms list, then those arguments are set to NULL in the dotargs collection. Looking at that now, I see it would have been more clear to use \code{setdiff()} on dotargs to clean that up. The central challenge in that work is to find out the names of the arguments that are intended for one function, then remove them from \code{dots} so that they are not passed to a subsequent function that needs them. How can one know what arguments are needed? This requires some experience, but the first step is to read the interface documentation of the function being considered. A shortcut is to run the function \code{formals()}, which prints out the names of the function being considered. This can sometimes be deceptive for implementations of generic methods. Observe that the return from the generic is not so informative as the return from the method: <<>>= formals(plot) formals(plot.default) @ If a method is not exported, if it does not show as ``visible'' in the output from \code{methods(plot)}, then a little more fancy footwork is required to see all of the legal arguments. <<>>= formals(graphics:::plot.histogram) @ \end{description} \subsection{Protect your function's calculations from the user's workspace.} In the early days of using R, around 2000, it was possible for a user to obliterate base functions. I recall working on a model about Democratic and Republican voters. I created a variable named \code{rep} and, after that, the \code{lm()} and \code{plot()} functions would not work. I wrote to r-help about this and learned that my \code{rep} had obliterated the built-in function \code{rep()}, which is vital in many calculations. Shortly after that, R was improved to protect functions from obliteration in that way. Today I tried to cause mischief by re-defining rep in a number of ways and it appears to me the problem is completely solved. There has also been a problem that users sometime use the variable names \code{T} and \code{F}. Recall that R allows \code{T} as an abbreviation for \code{TRUE} and \code{F} for \code{FALSE}. R will allow the user to re-define T and F in ways that will break calculations. However, R will not allow the user to re-define TRUE or FALSE. Observe \bgroup\inputencoding{latin9} \begin{lstlisting} > TRUE <- 2342 Error in TRUE <- 2342 : invalid (do_set) left-hand side to assignment > FALSE <- function(x,y){x + y} Error in FALSE <- function(x, y) { : invalid (do_set) left-hand side to assignment \end{lstlisting} \leavevmode\egroup In order to protect our function's calculations, we are urged to fully write out the symbols TRUE and FALSE inside functions. This way, users cannot cause damage. Another danger we need to be concerned about is that our functions might have undefined variables in them. When R encounters these undefined variables, the lexical scoping process looks outward for values. It may find something we don't expect. Another danger is that a function seems to work right when we practice it in our session, but another user gets an error because something in our workspace is missing from theirs. The lexical scope can be convenient, but it is also dangerous. Here the example function relies on variables \code{y} and \code{z} that are not passed in as arguments. \bgroup\inputencoding{latin9} \begin{lstlisting} x <- c(1, 2, 3, 4) y <- c(8, 9, 10, 11) z <- c(1, 1, 2, 2) plotme <- function(x){ plot(x, y, col = z) } plotme(x) \end{lstlisting} \leavevmode\egroup That lets the function go and find \code{y} and \code{z}. This will run, but it is pretty risky. It depends on R's willingness to reach outside the boundaries of the function to find \code{y}, and \code{z}. If I have inadvertently altered \code{y} in the workspace, this will end badly. If I'm writing a function for somebody else to use, as in a package, I believe the function should to NOT rely on variables from the environment. I want to prevent R from filling in the gaps for users. We'd rather give the user an error ``variable not found'' than an erroneous calculation that ``seems'' adequate. The best case scenario is that \code{y} and \code{z} do not exist in the workspace, so the user will receive an error message: \begin{Sinput} > plotme(x) \end{Sinput} \begin{Schunk} \begin{Soutput} Error in xy.coords(x, y, xlabel, ylabel, log) : object 'y' not found \end{Soutput} \end{Schunk} The worst case scenario is that the user has some other variables \code{y} and z\code{} sitting about in the workspace. Those ``found'' variables are put to use with a completely unintentional result. As a first step toward fixing this, I propose: \begin{quote} Every piece of input upon which we rely should be named in the list of arguments. \end{quote} If we define the function as follows, \bgroup\inputencoding{latin9} \begin{lstlisting} plotme <- function(x, y, z){ plot(x, y, col = z) } \end{lstlisting} \leavevmode\egroup the user's use of \code{plotme(x)} will result in the error \begin{Schunk} \begin{Soutput} Error in xy.coords(x, y, xlabel, ylabel, log) : argument "y" is missing, with no default \end{Soutput} \end{Schunk}Naming all of the arguments effectively protects the function against the accidental importation of data. There's a function called \code{checkUsage()} in the \code{codetools} package that can help check functions for reliance on global variables. First, we'll make sure the workspace is clean by removing \code{x}, \code{y}, and \code{z}. \begin{Schunk} \begin{Soutput} > rm(x, y, z) > library(codetools) > checkUsage(plotme) : no visible binding for global variable 'y' (:2) : no visible binding for global variable 'z' (:2) \end{Soutput} \end{Schunk} Note that, if the variables \code{x}, \code{y}, an \code{z} exist in the user workspace, they are global and \code{checkUsage} will not offer a warning about them. Hence, it is very important to run \code{checkUsage()} after cleaning up the workspace. checkUsage() will also offer warnings about unused variables. This can be helpful in cleaning up a function. \subsection{Check argument values.} We want users to provide input arguments for all information that the calculations will rely upon. The next step is to check their inputs to make sure they are reasonable. In some cases this is very easy. If we want the user to supply a data.frame, we can stop when the user provides something else. <<>>= plotme <- function(x, y, data, ...){ if(missing(data) | !is.data.frame(data)) stop(paste0("plotme: the object you supplied as data: '", deparse(substitute(data)) , "' is not a data.frame")) plot(as.formula(paste(y, "~", x, collapse=" ")), data = data, ...) } myx <- rnorm(10) myy <- rnorm(10) dat <- data.frame(myx, myy) @ This works properly <>= plotme("myx", "myy", data = dat) @ However, if the user supplies something else where data ought to be, an error is returned. \begin{Sinput} plotme("myx", "myy", rnorm(10)) \end{Sinput} \begin{Soutput} Error in plotme("myx", "myy", rnorm(10)) : plotme: the object you supplied as data: 'rnorm(10)' is not a data.frame \end{Soutput} The main point here is that if we definitely know what we expect the user to supply, then we can add a filter that checks the input's type. That's what is.character, is.vector, is.integer, and so forth are intended for. (In an R session, type ``is'' and then hit the TAB key a couple of times. You'll see all the completions. The problem is not so easy to manage when we can't be sure what the argument type must be. Sometimes, it appears the best we can do is pass through what the users give us and hope R can make sense out of it. I've wrestled with R's flexibility for the naming of colors. Suppose our plotme() function displays x and y, and we allow the user to specify colors for the points. <<>>= plotme <- function(x, y, z){ plot(x, y, col = z) } myx <- rnorm(10) myy <- rnorm(10) @ R colors can be supplied in many formats, as integers, character strings, or more elaborate HTML-style color designators. Any of these will work. <>= mycol <- 1:10 plotme(myx, myy, z = mycol) @ <>= mycol <- rainbow(10) plotme(myx, myy, z = mycol) @ <>= mycol <- gray.colors(10) plotme(myx, myy, z = mycol) @ R does not throw an error if the user runs this, but the result is not what we wanted, probably. <>= mycol <- rnorm(10) plotme(myx, myy, z = mycol) @ Suppose the user's code ends up supplying a NULL value for \code{z}, as in \code{plotme(x, y, z = NULL)}. They would not do that on purpose, but it might be an accident. If \code{z} is calculated in their code and something has gone wrong. To protect against that, we find this idiom is used in the R source code quite often: <>= plotme <- function(x, y, z){ if (missing(z) | is.null(z)) z <- rep(2, length(x)) plot(x, y, col = z) } @ If \code{z} is either not supplied by the user's command, or if the user's command explicitly sets it to NULL, then we'll fabricate \code{z}. \subsection{Design the function so that it runs with a minimum number of arguments.} Take a look at the most commonly used functions in R, such as lm() or plot.default(). These functions allow a great many arguments, but they will deliver what the users minimally need if the they run <>= m1 <- lm(y ~ x, data = dat) plot(y ~ x, data = dat) @ Put the arguments that users will use most often at the front of the argument list, and then design the function so that it will work even if all of the other arguments are not included. There are two elements in making this work. \subsubsection{Specify defaults} A graph can use any color scheme, but we might put in something colorful for users who like a little zest in life. <>= plotme <- function(x, y, z = rainbow(length(x))){ plot(x, y, col = z) } plotme(x = rnorm(100), y = rnorm(100)) @ We could as well have put the rainbow call inside the function if we want to keep a cleaner function declaration. <>= plotme <- function(x, y, z){ if(missing(z)) mycol <- rainbow(length(x)) plot(x, y, col = mycol) } plotme(x = rnorm(100), y = rnorm(100)) @ It seems to me that the R source itself tends to have more elaborate specifications of defaults within the declaration, rather than inside the function. In the R source code, I noticed a peculiar-looking example in which an argument's default depends on a function that is within the function itself. Observe that this runs fine: <>= plotme <- function(x, y, z = getAColor(length(x))) { getAColor <- function(n){ gray.colors(n) } plot(x, y, col = z, cex = 5) } plotme(rnorm(20), rnorm(20)) @ That was startling to me because the nested function, \code{getAColor}, does not exist outside the function, and yet it can be used in the definition of a default. I merely promise that \code{getAColor} will exist when the value of \code{z} is needed and all is well. \subsubsection{Extract or construct what else is needed from the user input} We can deduce information from the arguments that are passed in, we can eliminate the need for the user to pass in other arguments. We would never need to require both arguments x and xlab in the plot function, for example. xlab is optional because, if the user does not supply it, then deparse(substitute()) can find something to use as a label. The 2010 version of plotSlopes in rockchalk required the user to pass through many parameters that (I've since learned) can be deduced from the other arguments. A key element in this is getting an understanding of the information that comes in with a standard R object. Quite a few--almost all--functions that generate interesting objects in R provide, as a part of their output, a copy of the call that ran the function. Inside our functions, we can inspect that call argument to see what the users ran, and from that we can often deduce what we need without requiring users to supply a lot of information. \subsection{Return values versus attributes} If you review the R regression functions, like lm() or glm(), you notice that they build a list of things and pass it back to the user. That is the most obvious way to return information. There is another way, however, which is a little interesting. One can mark an attribute on a return object. The R usage of attributes comes to mind at this point because, when we are scanning for inputs, we may need to inspect not only their values, but also their attributes. The most commonly inspected attribute, of course, is the class of the argument. So, if we are writing a package and we need to make sure that the objects being passed around among functions are correct for the purpose, then the object attributes are a very useful tool. I don't want to write a lot about this so as to display my ignorance, but I would point user to \code{pctable()} as an example of a rockchalk function that uses both a return list and attributes to get the job done. Maybe I'll fill in this example at a later time, once I figure out the most direct way to explain it. The only problem I find with attributes is that the printing functions for R objects generally ``splat out'' the attributes when we don't want them. If one puts attributes on an object, then the default printer will show them. I've not learned all of the ins-and-outs of preventing that, but in 2015 I became aware of it and learned that one way to deal with it can be found in the R source code for the method \code{print.table()}. \section{Do This, Not That (Stub) } R novices sometimes use Google to search for R advice and they find it, good or bad. They may find their way to the r-help email list, where advice is generally good, or to the StackOverflow pages for R, which may be better. A lot of advice is offered by people like me, who may have good intentions, but are simply not qualified to offer advice. One of the few bits of advice that seems to grab widespread support is that ``for loops are bad.'' One can write an lapply statement in one line, while a for loop can take 3 lines. The code is shorter, but it won't necessarily run more quickly. I recall being jarred by this revelation in John Chambers's book, \emph{Software for Data Analysis}. The members of the apply family (apply, lapply, sapply, etc) can make for more readable code, but they aren't always faster. ``However, none of the apply mechanisms changes the number of times the supplied function is called, so serious improvements will be limited to iterating simple calculations many times. Otherwise, the n evaluations of the function can be expected to be the dominant fraction of the computation''\citep[213]{chambers_software_2008}. Todo: insert discussion of stackListItems-001. Insert alternative methods of measuring executation time and measuring performance Balance time spent optimizing code versus time spent running program. \section{Suggested Chores} I suggest the would be programmer should take on some basic challenges. \begin{enumerate} \item Try to create a generic function and several methods to handle classes of various types. If the term ``generic function'' and ``method'' cause disorientation, that means the reader is a user, not a programmer yet. \item Consider developing a routine (or package) for statistical estimation or presentation. Find several R packages and compare the R code in them. Let me know if you agree with me about these points. \end{enumerate} \begin{itemize} \item Good code is compartmentalized. Separate pieces of work are handled by separated functions and results are returned in a well organized way. \item Functions should not be HUGE. The components of a project should be small enough so that we can comprehend them. Generally speaking, if we are reading code and we come to a line that uses a variable that we cannot find on the screen, that's a problem. I used to correspond with a programmer at the University of Michigan who said he began to feel uncomfortable when a function filled up the entire terminal screen. \item People who copy and paste sections over and over in order to handle slightly different cases are causing trouble for themselves and others. They should think harder on ways to separate that work into re-usable functions. \end{itemize} \bibliographystyle{chicago} \bibliography{rockchalk} \end{document}