How to use this document

These notes contain many sample calculations. It is important to do these yourself—type them in at your keyboard and see what happens on your screen—to get the feel of working in R.

Exercises in the middle of a section should be done immediately when you get to them, and make sure you have them right before moving on.

For your convenience, the R codes for this document are provided in a script which you can download, edit, and run.

Many other similar introductions are scattered around the web; a partial list is in the “contributed documentation” section on the R web site (https://cran.r-project.org/other-docs.html). This particular version is limited (it has similar coverage to the standard Introduction to R manual and targets biologists who are neither programmers nor statisticians (yet).

Throughout this document, Windows-specific items are marked with a Ⓦ and *nix (linux/unix/MacOSX)-specific items with a Ⓧ.

What is R ?

R is a computing environment that combines

  • a programming language called S, developed by John Chambers at Bell Labs, that implements the idea of programming with data (Chambers 1998),
  • an extensive set of functions for classical and modern statistical data analysis and modeling,
  • powerful numerical analysis tools for linear algebra, differential equations, and stochastics,
  • graphics functions for visualizing data and model output,
  • a modular and extensible structure that supports a vast array of optional add-on packages, and
  • extensive help and documentation facilities.

R is an open source software project, available for free download (R Core Team 2014b). Originally a research project in statistical computing (Ihaka and Gentleman 1996), it is now managed by a development team that includes a number of well-regarded statisticians, and is widely used by statistical researchers and working scientists as a platform for making new methods available to users.

There are a number of graphical front-ends for R. At present, the best of these appears to be RStudio. Before learning about these, however, you should learn a little about R itself.

Getting started with R

Installing R on your computer

The main source for R is the Comprehensive R Archive Network (CRAN). You can get the source code and compile it yourself, but you may prefer at this point to download and install a precompiled version. You can download precompiled binaries for most major platforms from any CRAN mirror. To do so, go to https://cran.r-project.org/mirrors.html and find a mirror site that is geographically somewhat near you. Find the appropriate page for your operating system, and read and follow the installation instructions. Be sure to install the latest version.

R should work well on any reasonably recent computer.

Ⓦ Under Windows, R is installed by launching the downloaded file and following the on-screen instructions. At the end you’ll have an R icon on your desktop that can be used to launch the program.

Ⓧ Under Linux, binary versions of R are available as packages for the most common Linux distributions. It is also not difficult to install R from source.

Ⓧ Under Mac OS X, R is available as a binary package.

The standard distributions of R include several packages: user-contributed suites of add-on functions. These notes use some additional packages which you will have to install before using. Ⓦ In the Windows version additional packages can be installed easily from within R using the Packages menu. Under all platforms, you can use the commands install.packages() and update.packages() to install and update packages, respectively. Most packages are available pre-compiled for MacOS X and Windows; under Linux, R will download and compile the source code for you.

Starting R

Ⓧ Execute R on the command line.

Ⓦ Click on the icon on your desktop, or in the Start menu (if you allowed the Setup program to make either or both of these).

Stopping R

When entering, always look for the exit. —Lebanese proverb

Stop R by typing q() at the command prompt. [Note the (): if you type q by itself, you will get some confusing output which is actually R trying to tell you the definition of the q function; more on this later.]

Ⓦ  You can also stop R from the File menu.

When you quit, R will ask you if you want to save the workspace (that is, all of the variables you have defined in this session); in general, you should say “no” to avoid clutter and unintentional confusion of results from different sessions. Note: When you say “yes” to saving your workspace, it is saved in a hidden file named .RData. By default, when you open a new R session in the same directory, this workspace is loaded and a message informing you so is printed:

[Previously saved workspace restored]

Should an R command seem to be stuck or take longer than you’re willing to wait, click on the stop sign on the menu bar or hit the Esc key (Ⓦ), or Ctrl-c (Ⓧ).

Interactive calculations

Ⓦ When you start R, a console window is opened. The console has a few basic menus at the top; check them out on your own.

The console is where you enter commands for R to execute interactively, meaning that the command is executed and the result is displayed as soon as you hit the Enter key. For example, at the command prompt >, type in 2+2 and hit Enter; you will see

2+2 

output

[1] 4

The results from calculations can be stored in (assigned to) variables. For example:

a <- 2+2

R automatically creates the variable a and stores the result (4) in it, but by default doesn’t print anything. To ask R to print the value, just type the variable name by itself

a

output

[1] 4

The [1] at the beginning of the line is just R printing an index of element numbers; if you print a result that displays on multiple lines, R will put an index at the beginning of each line. print(a) also works to print the value of a variable. By default, a variable created this way is a vector (an ordered list), and it is numeric because we gave R a number rather than (e.g.) a character string like "pxqr"; in this case a is a numeric vector of length 1;

You could also type

a <- 2+2; a

using a semicolon to put two or more commands on a single line. Conversely, you can break lines anywhere that R can tell you haven’t finished your command and R will give you a “continuation” prompt (+) to let you know that it doesn’t think you’re finished yet: try typing

a <- 3*(4+
5)

to see what happens (this often happens e.g. if you forget to close parentheses or quotes). If you get stuck continuing a command you don’t want—e.g., you opened the wrong parentheses—just hit Ctrl-c (Ⓧ), the Esc key or the stop icon in the menu bar (Ⓦ) to get out.

You can assign values to variables in R using the <- operator. There are several alternative forms for assignment. Each of these three commands accomplishes the same thing; the first is the preferred form, however.

a <- 2+2
2+2 -> a
a = 2+2

Variable names in R must begin with a letter, followed by alphanumeric characters. You can break up long names with a period, as in long.variable.number.3, an underscore (very_very_long_variable_name), or by using camel case (quiteLongVariableName); you cannot use blank spaces in variable names. R is case sensitive: Abc and abc are different variables. Good practice is to make variable names long enough to be evocative but short enough to type easily. N.per.ha or pop.density are better than x and y (too generic) or available.nitrogen.per.hectare (too long).

Note that c, q, t, C, D, F, I, and T, are built-in R functions. Using these are variable names may cause confusion or actual errors.

R uses +, -, *, /, and ^ for addition, subtraction, multiplication, division and exponentiation, respectively. For example:

x <- 5
y <- 2
z1 <- x*y
z2 <- x/y
z3 <- x^y
z1; z2; z3

output

[1] 10

output

[1] 2.5

output

[1] 25

You can retrieve and edit previous commands. The up-arrow (\(\uparrow\) ) key or Ctrl-p recalls previous commands to the prompt. For example, you can bring back the second-to-last command and edit it to

z3 <- 2*x^y

Experiment with the \(\downarrow\), \(\rightarrow\), \(\leftarrow\), Home and End keys too (also Ctrl-n, Ctrl-b, Ctrl-f, Ctrl-a, Ctrl-e).

You can combine several operations in one calculation:

A <- 3
C <- (A+2*sqrt(A))/(A+5*sqrt(A)); C

output

[1] 0.5543706

Parentheses specify the order of operations. The command

C <- A+2*sqrt(A)/A+5*sqrt(A)

is not the same as the one above; rather, it is equivalent to

C <- A + 2*(sqrt(A)/A) + 5*sqrt(A)

The default order of operations is: (1) parentheses, (2) exponentiation, (3) multiplication and division, (4) addition and subtraction. Thus:

> b <- 12-4/2^3 gives 12 - 4/8 = 12 - 0.5 = 11.5
> b <- (12-4)/2^3 gives 8/8 = 1
> b <- -1^2 gives -(1^2) = -1
> b <- (-1)^2 gives 1

In complicated expressions it’s best to use parentheses to specify explicitly what you want, such as > b <- 12 - (4/(2^3)) or at least > b <- 12 - 4/(2^3); a few extra sets of parentheses never hurt anything.

R also has many built-in mathematical functions that operate on variables (see below).


Exercise

Using keyboard shortcuts wherever you can, use R to compute:

  1. \(\frac{2^7}{2^7 - 1}\)
  2. \(({1 - \frac{1}{2^7}})^{-1}\)
  3. \(1+0.2\)
  4. \(1+0.2+0.2^2/2\)
  5. \(1+0.2+0.2^2/2+0.2^3/6\)
  6. \(e^{0.2}\)

Remembering that R knows exp() but not Euler’s number, \(e\), how would you get R to tell you the value of \(e\)?

Compute the standard normal probability density, \(\frac{1}{\sqrt{2 \pi}} e^{-x^2/2}\), for values of \(x=1\) and \(x=2\). [R knows \(\pi\) as pi.] Check your answers against those given by the built-in function for the normal distribution, dnorm(x=c(1,2)).


The help system

R has a help system, although it is generally better for reminding you about syntax or details, and for giving cross-references, than for answering basic “how do I …?” questions.

  • You can get help on an R function named foo by entering
?foo

or

help(foo)

into the console window (e.g., try ?sin).

  • Ⓦ The Help menu on the tool bar provides links to other documentation, including the manuals and FAQs, and a search facility (Apropos on the menu) which is useful if you half remember part of the name of what it is you need help on.
  • Typing help.start() opens a web browser with help information.
  • example(cmd) will run any examples that are included in the help page for command cmd.
  • demo(topic) runs demonstration code on topic topic: type demo() by itself to list all available demos

By default, R ’s help system only provides information about functions that are in the base system and packages that you have already loaded (see below).

  • ??topic or help.search("topic") will list information related to topic available in the base system or in any extra installed packages: then use ?topic to see the information, perhaps using library(pkg) to load the appropriate package first.
  • help.search uses fuzzy matching—for example, help.search("log") finds more than 800 entries (on my particular system) including lots of functions with “plot”, which includes the letters “lot”, which are almost like “log”. If you can’t stand it, you can turn this behavior off with the incantation help.search("log",agrep=FALSE) (81 results which still include matches for “logistic”, “myelogenous”, “phylogeny”, ).
  • help(package="pkg") will list all the help pages for a loaded package.
  • example(fun) will run the examples (if any) given in the help for a particular function fun: e.g., example(log)
  • RSiteSearch("topic") does a full-text search of all the R documentation and the mailing list archives for information on topic (you need an active internet connection).
  • the sos package is a web-aware help function that searches all of the packages on CRAN; its findFn function tries to find and organize functions in any package on CRAN that match a search string (again, you need a network connection for this).

Try out one or more of these aspects of the help system.

Finally, here’s a commentary on the help system in R from Graham Lawrence:

I hate to say this, but what really helped me the most, after the initial feet-wetting, was to abandon the help manuals. Searching manuals for the answer to a specific question is frustrating because one does not know the key term for the search engine to deliver the item needed.

So I stopped being serious and production oriented and simply played with R for a couple of months. Open the Base package, scan the list of contents for titles that pique my curiosity, paste their examples into R and see what happens. And those examples use other functions and their documentation has more examples and so on and on; and pretty soon after, whoops, there’s another afternoon gone down the tubes. But all this apparent time wasting had a most happy result. I now find, much more often than not, that I know what I need to look up to answer a specific question, and that does wonders for my disposition.

I think of R as a vast jungle, criss-crossed with myriad game trails (the documentation to each function in the packages). And I can explore each trail and see what animal it leads to in its native habitat, by pasting the examples into R and examining the result. So when I see an interesting spoor or paw print, I take a stroll down that trail and see where it leads. Not the most efficient way of learning the language, no doubt, but a pleasant and interesting entertainment rather than a chore.


Exercise

Do an Apropos on sin to see what it does. Now enter the command

??sin

and see what that does. [Answer: help.search pulls up all help pages that include sin anywhere in their title or text. Apropos just looks at the name of the function.] Note that ??sin is equivalent to help.search("sin"). If you have a net connection, try RSiteSearch("sin") from the command line or the equivalent from the menu and see what happens.


Some of the built-in mathematical functions in R.
R command function
abs() absolute value, |x|
cos(), sin(), tan() cosine, sine, tangent
acos(), asin(), atan() arc-cosine, arc-sine, arc-tangent
exp(x) exponential function, \(e^x\)
log(x) natural (base-\(e\)) logarithm, \(\log{x}=\ln{x}\)
log10(x) base-10 logarithm, \(\log_{10}{x}\)
sqrt(x) square root, \(\sqrt{x}\)
atan2(y,x) \(\arctan(y/x)\)
sum() sum of the entries in a vector
diff() first differences
cumsum() cumulative sum
gamma(x) the Gamma function, \(\Gamma(x)\)

You can get a more complete list of functions from the help system: ?Arithmetic for arithmetic operations, ?log for logarithmic functions, ?sin for trigonometric functions, and ?Special for special functions.


A first interactive session

Descriptive statistics

Below are some data on the maximum growth rate \(r_{\text{max}}\) of laboratory populations of the green alga Chlorella vulgaris as a function of light intensity (\(\mu\mathrm{E}~\mathrm{m}^{-2}~\mathrm{s}^{-1}\)). These experiments were run during the system-design phase of the study reported by Fussman, Ellner, Sherzer, and Nelson G. Hairston (2000).

Light: 20, 20, 20, 20, 21, 24, 44, 60, 90, 94, 101
\(r_{\text{max}}\): 1.73, 1.65, 2.02, 1.89, 2.61, 1.36, 2.37, 2.08, 2.69, 2.32, 3.67

To analyze these data in R , first enter them as numerical vectors:

Light <- c(20,20,20,20,21,24,44,60,90,94,101)
rmax <- c(1.73,1.65,2.02,1.89,2.61,1.36,2.37,2.08,2.69,2.32,3.67)

The function c() combines the individual numbers into a vector. Try recalling (with \(\uparrow\)) and modifying the above command to

Light <- 20,20,20,20,21,24,44,60,90,94,101

and see the error message you get: in order to create a vector of specified numbers, you must use the c() function.

To see a histogram of the growth rates enter hist(rmax), which opens a graphics window and displays the histogram. There are many other built-in statistics functions in R. Some of the most commonly used are shown in the table below. Play around with these functions, and any others you can think of.


Some simple descriptive statistics in R
mean(x) the arithmetic mean of the data in x
exp(mean(log(x))) the geometric mean of x
1/mean(1/x) the harmonic mean of x
median(x) the median of x
min(x), max(x) the minimum and maximum, respectively, of x
range(x) the range of x
sd(x) the standard deviation of x
var(x) the variance of x
quantile(x, p) the p-th quantiles of x
ecdf(x) the empirical cumulative distribution function of x

To see how the algal rate of increase varies with light intensity, type

plot(rmax~Light)

plot

to plot rmax (\(y\)) against Light (\(x\)). You can also do plot(Light,rmax). Based on what you see, does it seem reasonable to hypothesize a linear relationship between these variables?

Linear regression

To perform linear regression we create a linear model using the lm() function:

fit <- lm(rmax~Light)

In the formula rmax~Light, rmax is the response variable and Light is the predictor.

The lm command created an object we have named fit. In R , an object is a data structure consisting of multiple parts. In this case, fit holds the results of the linear regression analysis. Unlike other statistics packages, R rarely summarizes an analysis for you by default. Statistical analyses in R are done by fitting a model to data and then issuing additional commands to extract desired information about the model or display results graphically.

To get a summary of the results, enter the command summary(fit). R sets up model objects (more on this later) so that the function summary() “knows” that fit was created by lm(), and produces an appropriate summary of results for an lm() object:

summary(fit)

output


Call:
lm(formula = rmax ~ Light)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5478 -0.2607 -0.1166  0.1783  0.7431 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.580952   0.244519   6.466 0.000116 ***
Light       0.013618   0.004317   3.154 0.011654 *  
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4583 on 9 degrees of freedom
Multiple R-squared:  0.5251,    Adjusted R-squared:  0.4723 
F-statistic: 9.951 on 1 and 9 DF,  p-value: 0.01165

If you’ve had a statistics course, the output will make sense to you. The table of coefficients gives the estimated regression line as \[\text{rmax}=1.58+0.0136\times\text{Light}.\] Associated with each coefficient is the standard error of the estimate. In addition, for each coefficient, a \(t\)-test is performed testing the hypothesis that the coefficient differs from zero: the \(t\)-statistic and the corresponding \(p\)-values are reported. Finally, the summary reports \(R^2\) values and the results of an \(F\)-test comparing the model to a simple normal null hypothesis.

You can add the regression line to the plot of the data with a function taking fit as its input (if you closed the plot of the data, you will need to create it again in order to add the regression line):

abline(fit)

[abline is a general-purpose function for adding lines to a plot. You can specify horizontal or vertical lines, a slope and an intercept, or a regression model: ?abline.]

plot

You can extract the coefficients from the fitted model using coef:

coef(fit)

output

(Intercept)       Light 
 1.58095214  0.01361776 

There are many other results stored inside fit. Try the following:

residuals(fit)
fitted(fit)
effects(fit)
vcov(fit)
anova(fit)

Statistics in R

Some of the important functions and packages (collections of functions) for statistical modeling and data analysis are summarized in following table. Functions are named in fixed-width font and packages in bold face. The book Modern Applied Statistics with S (Venables and Ripley 2002) gives a good practical overview, and a list of available packages and their contents can be found at the main R website (https://cran.r-project.org, and click on Packages).

A few of the functions and packages in R for statistical modeling and data analysis are listed below. There are many more, but you will have to learn about them somewhere else.


Statistical modeling and data analysis functions and packages.
   
aov, anova Analysis of variance or deviance
lm Linear models (regression, ANOVA, ANCOVA)
glm Generalized linear models (e.g. logistic, Poisson regression)
gam Generalized additive models (in package mgcv)
nls Nonlinear models via least-squares
lme, nlme Linear and nonlinear mixed-effects models (repeated measures, block effects, spatial models): in package nlme
princomp, manova, lda, cancor Multivariate analysis (see also packages vegan, ade4)
boot Package: bootstrapping functions
splines Package: nonparametric regression (more in packages fields, KernSmooth, logspline, sm and others)
survival Package: survival analysis

The R package system

R has many extra packages that provide extra functions. You may be able to install new packages from a menu within R. You can always type, e.g.,

install.packages("ggplot2")

This installs the ggplot2 package. You can install more than one package at a time:

install.packages(c("dplyr","tidyr","lubridate"))

If the machine on which you use R is not connected to the Internet, you can download the packages to some other medium (such as a flash drive or CD) and install them later, using Install from local zip file in the menu (Ⓦ) or

install.packages("ggplot2",repos=NULL)

If you do not have permission to install packages in R ’s central directory, R will may ask whether you want to install the packages in a user-specific directory. It is safe to answer “yes” here.

You will frequently get a warning message like

Warning message: In file.create(f.tg) :
cannot create file '.../packages.html', reason 'Permission denied'.

Don’t worry about this; it means the package has been installed successfully, but the main help system index files couldn’t be updated because of file permissions problems.

Data structures in R

Vectors

The most basic data-type in R is the vector. A vector is just a 1-dimensional array of values. Several different kinds of vectors are available:

  • numerical vectors,
  • logical vectors,
  • character-string vectors,
  • factors,
  • ordered factors, and
  • lists.

Lists are a bit different from the other kinds, so we’ll postpone talking about them until later.

A vector’s defining attributes are its mode—which kind of vector it is— and its length. Vectors can also have a names attribute, which allows one to refer to elements by name.

We’ve already seen how to create vectors in R using the c function, e.g.,

x <- c(1,3,5,7,9,11)
y <- c(6.5,4.3,9.1,-8.5,0,3.6)
z <- c("dog","cat","dormouse","chinchilla")
w <- c(a=4,b=5.5,c=8.8)
length(x)

output

[1] 6
mode(y)

output

[1] "numeric"
mode(z)

output

[1] "character"
names(w)

output

[1] "a" "b" "c"

The nice thing about having vectors as a basic type is that many operations in R are efficiently vectorized. That is, the operation acts on the vector as a unit, saving you the trouble of treating each entry individually. For example:

x <- x+1
xx <- sqrt(x)
x; xx

output

[1]  2  4  6  8 10 12

output

[1] 1.414214 2.000000 2.449490 2.828427 3.162278 3.464102

Notice that the operations were applied to every entry in the vector. Similarly, commands like x-5, 2*x, x/10, and x^2 apply subtraction, multiplication, division, and square to each element of the vector. The same is true for operations involving multiple vectors:

x+y

output

[1]  8.5  8.3 15.1 -0.5 10.0 15.6

In R the default is to apply functions and operations to vectors in an element by element manner; anything else (e.g. matrix multiplication) is done using special notation (discussed below).


Exercise

What do the %% and %/% operators do?


Element recycling

R has a very useful, but unusual and perhaps unexpected, behavior when two vector operands in a vectorized operation are of unequal lengths. It will effectively extend the shorter vector using element “re-cycling”: re-using elements of the shorter vector. Thus

x <- c(1,2,3)
y <- c(10,20,30,40,50,60)
x+y

output

[1] 11 22 33 41 52 63
y-x

output

[1]  9 18 27 39 48 57

Exercise

What happens when the length of the longer vector is not a multiple of that of the shorter?


Functions for creating vectors

A set of regularly spaced values can be created with the seq function, whose syntax is x <- seq(from,to,by) or x <- seq(from,to) or x <- seq(from,to,length.out). The first form generates a vector (from,from+by,from+2*by,...) with the last entry not extending further than to; in the second form the value of by is assumed to be 1 or -1, depending on whether from or to is larger; and the third form creates a vector with the desired endpoints and length. There is also a shortcut for creating vectors with by=1:

1:8

output

[1] 1 2 3 4 5 6 7 8

Exercise

Use seq to create the vector v=(1 5 9 13), and to create a vector going from 1 to 5 in increments of 0.2 .


Exercise

What happens when to is less than from in seq? What result does 3:1 give? This is one of the first “gotchas” R newbies run into.


A constant vector such as (1 1 1 1) can be created with rep function, whose basic syntax is rep(values,lengths).
For example,

rep(3,5)

output

[1] 3 3 3 3 3

creates a vector in which the value 3 is repeated 5 times. rep() will repeat a whole vector multiple times

rep(1:3,3)

output

[1] 1 2 3 1 2 3 1 2 3

or will repeat each of the elements in a vector a given number of times:

rep(1:3,each=3)

output

[1] 1 1 1 2 2 2 3 3 3

Even more flexibly, you can repeat each element in the vector a different number of times:

rep(c(3,4),c(2,5))

output

[1] 3 3 4 4 4 4 4

The value 3 was repeated 2 times, followed by the value 4 repeated 5 times. rep() can be a little bit mind-blowing as you get started, but you’ll get used to it—and it will turn out to be useful.


Some important R functions for creating and working with vectors.
   
seq(from,to,by=1) Vector of evenly spaced values (default increment = 1)
seq(from,to,length.out) Vector of evenly spaced values, specified length
c(u,v,...) Combine a set of numbers and/or vectors into a single vector
rep(a,b) Create vector by repeating elements of a b times each
hist(v) Histogram plot of value in v
mean(v),var(v),sd(v) Estimate of population mean, variance, standard deviation based on data values in v
cov(v,w) Covariance between two vectors
cor(v,w) Correlation between two vectors

Many of these have other optional arguments; use the help system (e.g. ?cor) for more information. The statistical functions such as var regard the values as samples from a population and compute the unbiased estimate of the population statistic; for example sd(1:3)=1.


Vector indexing

It is often necessary to extract a specific entry or other part of a vector. This procedure is called vector indexing, and uses square brackets ([]):

z <- c(1,3,5,7,9,11); z[3]

output

[1] 5

z[3] extracts the third element of the vector z. You can also access a block of elements by giving a vector of indices:

v <- z[c(2,3,4,5)]

or

v <- z[2:5]; v

output

[1] 3 5 7 9

This has extracted the 2nd through 5th elements in the vector.


Exercise

If you enter v <- z[seq(1,5,2)], what will happen? Make sure you understand why.


Extracted parts of a vector don’t have to be regularly spaced. For example

v <- z[c(1,2,5)]; v

output

[1] 1 3 9

Indexing is also used to set specific values within a vector. For example,

z[1] <- 12

changes the value of the first entry in z while leaving all the rest alone, and

z[c(1,3,5)] <- c(22,33,44)

changes the 1st, 3rd, and 5th values.

Elements in a named vector can be accessed and modified by name as well as by position. Thus

w

output

  a   b   c 
4.0 5.5 8.8 
w["a"]

output

a 
4 
w[c("c","b")]

output

  c   b 
8.8 5.5 
w["b"] <- 0
w

output

  a   b   c 
4.0 0.0 8.8 

Exercise

Write a one-line command to extract a vector consisting of the second, first, and third elements of z in that order.


Exercise

What happens when I set the value of an element that doesn’t exist? For example, try

z[9] <- 11

You may be wondering if vectors in R are row vectors or column vectors (if you don’t know what those are, don’t worry). The answer is “both and neither”. Vectors are printed out as row vectors, but if you use a vector in an operation that succeeds or fails depending on the vector’s orientation, R will assume that you want the operation to succeed and will proceed as if the vector has the necessary orientation. For example, R will let you add a vector of length 5 to a \(5 \times 1\) matrix or to a \(1 \times 5\) matrix, in either case yielding a matrix of the same dimensions.


Exercise

Write code that computes values of \(y=\frac{(x-1)}{(x+1)}\) for \(x=1,2,\cdots,10\), and plots \(y\) versus \(x\) with the points plotted and connected by a line.


Warning: unavoidable imprecision

Note that comparing very similar numeric values can be tricky: rounding can happen, and some numbers cannot be represented exactly on a digital computer. Although R represents floating point numbers using double precison (64 bits), by default, it only displays 7 significant digits. This behavior can be modified with options("digits").

Consider the following three lines of R code.

x <- 1.999999; x; x-2

output

[1] 1.999999

output

[1] -1e-06
x <- 1.9999999999999; x; x-2

output

[1] 2

output

[1] -9.992007e-14
x <- 1.99999999999999999; x; x-2

output

[1] 2

output

[1] 0

In the second case, all the digits are still there, but they are not shown, since x rounded to 7 decimal places is 2. Also note that x-2 is not, as it should be, exactly \(-1 \times 10^{-13}\); this loss of precision is unavoidable.

What happened in the third case?


Exercise

The sum of the geometric series \(1 + r + r^2 + r^3 + ... + r^n\) approaches the limit \(1/(1-r)\) for \(r < 1\) as \(n \rightarrow \infty\). Take \(r=0.5\) and \(n=10\), and write a one-line command that creates the vector \(G = (r^0,r^1,r^2,...,r^n)\). Compare the sum (using sum()) of this vector to the limiting value \(1/(1-r)\). Repeat for \(n=50\).


Logical operations

Some operations return a logical value (i.e., TRUE or FALSE). For example, try:

a <- 1; b <- 3; 
c <- a < b; c

output

[1] TRUE
d <- (a > b); d

output

[1] FALSE

The parentheses around a > b above are optional but do make the code easier to read. Be careful when you make comparisons with negative values: a<-1 may surprise you by setting a to 1, because <- is the assignment operator in R. Use a< -1 or a<(-1) to make this comparison.


Some comparison operators in R. Use ?Comparison to learn more.
R code comparison
x < y \(x\) strictly less than \(y\)
x > y \(x\) strictly greater than \(y\)
x <= y \(x\) less than or equal to \(y\)
x >= y \(x\) greater than or equal to \(y\)
x == y \(x\) equal to \(y\)
x != y \(x\) not equal to \(y\)
identical(x,y) \(x\) completely identical to \(y\)
all.equal(x,y) \(x\) pretty much equal to \(y\)

When we compare two vectors or matrices, comparisons are done element-by-element (and the recycling rule applies). For example,

x <- 1:5
b <- (x<=3); b

output

[1]  TRUE  TRUE  TRUE FALSE FALSE

So if x and y are vectors, then (x==y) will return a vector of values giving the element-by-element comparisons. If you want to know whether x and y are identical vectors, use identical(x,y) or all.equal(x,y). You can use ?Logical to read more about logical operations. Note the difference between = and ==. Can you figure out what happened in the following cautionary tale?

a=1:3
b=2:4
a==b

output

[1] FALSE FALSE FALSE
a=b
a==b

output

[1] TRUE TRUE TRUE

R can also do arithmetic on logical values, treating TRUE as 1 and FALSE as 0. So sum(x<3) returns the value 2, telling us that two entries of x satisfied the condition (x<3). This is useful for counting the number of elements of a vector that satisfy a given condition.

More complicated conditions are built by using logical operators to combine comparisons. The most important of these are tabulated here.


Logical operators.
operator meaning
! logical NOT
& logical AND, elementwise
&& logical AND, first element only
&#124; logical OR, elementwise
&#124;&#124; logical OR, first element only
xor(x,y) exclusive OR, elementwise

For example, execute the following and make sure you understand what happens.

a <- c(1,2,3,4)
b <- c(1,1,5,5)
(a<b) | (a>3)

output

[1] FALSE FALSE  TRUE  TRUE
a<b) || (a>3)

The two forms of logical OR (|and ||) are inclusive, meaning that x|y is true if either x or y or both are true. Use xor when exclusive OR is needed. The two forms of AND and OR differ in how they handle vectors. The shorter one (|, &) does element-by-element comparisons; the longer one (||, &&) looks only at the first element in each vector.

More on vector indexing

We can also use logical vectors (lists of TRUE and FALSE values) to pick elements out of vectors. This is useful, for example, in subsetting data.

As a simple example, we might want to focus on just the low-light values of \(r_{\text{max}}\) in the Chlorella example:

lowLight <- Light[Light<50]
lowLightrmax <- rmax[Light<50]
lowLight

output

[1] 20 20 20 20 21 24 44
lowLightrmax

output

[1] 1.73 1.65 2.02 1.89 2.61 1.36 2.37

What is really happening here (think about it for a minute) is that Light<50 generates a logical vector the same length as Light (TRUE TRUE TRUE ...) which is then used to select the appropriate values.

If you want the positions at which Light is lower than 50, you can use which: which(Light<50). If you wanted the position at which the maximum value of Light occurs, you could say which(Light==max(Light)) or which.max(Light). Note that, if Light has several elements that are maximal, the first will return the positions of them all, while the second will return the position only of the first one.


Exercise

What would happen if instead of setting lowLight you replaced Light by saying Light <- Light[Light<50]? Why would that be the wrong thing to do in the above example?

We can also combine logical operators:

Light[Light<50 | rmax <= 2.0]

output

[1] 20 20 20 20 21 24 44
rmax[Light<50 | rmax <= 2.0]

output

[1] 1.73 1.65 2.02 1.89 2.61 1.36 2.37

Note that we have to be sure to use the elementwise versions of AND and OR (& and |).


Exercise

runif(n) is a function that generates a vector of n random, uniformly distributed numbers between 0 and 1. Create a vector of 20 numbers, then find the subset of those numbers that is less than the mean. More on runif and related functions soon.


Exercise

Find the positions of the elements that are less than the mean of the vector you just created (e.g. if your vector were (0.1 0.9 0.7 0.3) the answer would be (1 4)).


As mentioned above, vectors can have names associated with their elements: if they do, you can also extract elements by name (use names() to find out the names).

x <- c(first=7,second=5,third=2)
names(x)

output

[1] "first"  "second" "third" 
x["first"]

output

first 
    7 
x[c("third","first")]

output

third first 
    2     7 
x[c('second','first')] <- c(8,9); x

output

 first second  third 
     9      8      2 

Finally, it is sometimes handy to be able to drop a particular set of elements, rather than taking a particular set: you can do this with negative indices. For example, x[-1] extracts all but the first element of a vector.


Exercise

Specify two ways to take only the elements in the odd positions (first, third, …) of a vector of arbitrary length.


Matrices and arrays

Creating matrices

A matrix is a two-dimensional array of items. Most straightforwardly, we can create a matrix by specifying the number of rows and columns, and specifying the entries. For example

X <- matrix(c(1,2,3,4,5,6),nrow=2,ncol=3); X

output

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

takes the values 1 to 6 and reshapes them into a 2 by 3 matrix. Note that the values in the data vector are put into the matrix column-wise, by default. You can change this by using the optional parameter byrow:

A <- matrix(1:9,nrow=3,ncol=3,byrow=TRUE); A

output

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

R will re-cycle through entries in the data vector, if need be, to fill a matrix of the specified size. So for example

matrix(1,nrow=50,ncol=50)

creates a \(50{\times}50\) matrix, every entry of which is \(1\).


Exercise
Use a command of the form X <- matrix(v,nrow=2,ncol=4) where v is a data vector, to create the following matrix X:

output

     [,1] [,2] [,3] [,4]
[1,]    1    1    1    1
[2,]    2    2    2    2

R will also cause a matrix to behave like a vector whenever it makes sense: for example sum(X) above is 12.


Exercise

Use rnorm and matrix to create a \(5{\times}7\) matrix of Gaussian random numbers with mean 1 and standard deviation 2.


Another useful function for creating matrices is diag. diag(v,n) creates an \(n{\times}n\) matrix with data vector \(v\) on its diagonal. So for example diag(1,5) creates the \(5{\times}5\) identity matrix, which has 1s on the diagonal and 0 everywhere else.

Finally, one can use the data.entry function. This function can only edit existing matrices, but for example (try this now!)

A <- matrix(0,3,4)
data.entry(A)

will create A as a \(3{\times}4\) matrix, and then call up a spreadsheet-like interface in which the values can be edited directly. You can further modify A with the same primitive interface, using fix.


Some important functions for creating and working with matrices.
R code purpose
matrix(v,nrow=m,ncol=n) \(m \times n\) matrix using the values in v
t(A) transpose (exchange rows and columns) of matrix A
dim(X) dimensions of matrix X. dim(X)[1]=# rows, dim(X)[2]=# columns
data.entry(A) call up a spreadsheet-like interface to edit the values in A
diag(v,n) diagonal \(n \times n\) matrix with \(v\) on diagonal, 0 elsewhere (v is 1 by default, so diag(n) gives an \(n \times n\) identity matrix)
cbind(a,b,c,...) combine compatible objects by attaching them along columns
rbind(a,b,c,...) combine compatible objects by attaching them along rows
as.matrix(x) convert an object of some other type to a matrix, if possible
outer(v,w) “outer product” of vectors v, w: the matrix whose \((i,j)\)-th element is v[i]*w[j]

Many of these functions have additional optional arguments; use the help system for full details.


cbind and rbind

If their sizes match, vectors can be combined to form matrices, and matrices can be combined with vectors or matrices to form other matrices. The functions that do this are cbind and rbind.

cbind binds together columns of two objects. One thing it can do is put vectors together to form a matrix:

C <- cbind(1:3,4:6,5:7); C

output

     [,1] [,2] [,3]
[1,]    1    4    5
[2,]    2    5    6
[3,]    3    6    7

Remember that R interprets vectors as row or column vectors according to what you’re doing with them. Here it treats them as column vectors so that columns exist to be bound together. On the other hand,

D <- rbind(1:3,4:6); D

output

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

treats them as rows. Now we have two matrices that can be combined.


Exercise

Verify that rbind(C,D) works, cbind(C,C) works, but cbind(C,D) doesn’t. Why not?


Matrix indexing

Matrix indexing is like vector indexing except that you have to specify both the row and column, or range of rows and columns. For example z <- A[2,3] sets z equal to 6, which is the (2nd row, 3rd column) entry of the matrix A that you recently created, and

A[2,2:3]; 

output

[1] 5 6
B <- A[2:3,1:2]; B

output

     [,1] [,2]
[1,]    4    5
[2,]    7    8

There is an easy shortcut to extract entire rows or columns: leave out the limits, leaving a blank before or after the comma.

first.row <- A[1,]; first.row

output

[1] 1 2 3
second.column <- A[,2]; second.column;

output

[1] 2 5 8

(What does A[,] do?)

As with vectors, indexing also works in reverse for assigning values to matrix entries. For example,

A[1,1] <- 12; A

output

     [,1] [,2] [,3]
[1,]   12    2    3
[2,]    4    5    6
[3,]    7    8    9

The same can be done with blocks, rows, or columns, for example

A[1,] <- c(2,4,5); A

output

     [,1] [,2] [,3]
[1,]    2    4    5
[2,]    4    5    6
[3,]    7    8    9

If you use which() on a matrix, R will normally treat the matrix as a vector—so for example which(A==8) will give the answer 6 (can you see why?). However, which() does have an option that will treat its argument as a matrix:

which(A>=8,arr.ind=TRUE)

output

     row col
[1,]   3   2
[2,]   3   3

Arrays

The generalization of the matrix to more (or less) than 2 dimensions is the array. In fact, in R , a matrix is nothing other than a 2-dimensional array. How does R store arrays? In the simplest possible way: an array is just a vector plus information on the dimensions of the array. Most straightforwardly, we can create an array from a vector:

X <- array(1:24,dim=c(3,4,2)); X

output

, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

, , 2

     [,1] [,2] [,3] [,4]
[1,]   13   16   19   22
[2,]   14   17   20   23
[3,]   15   18   21   24

Note, again, that the arrays are filled in a particular order: the first dimension first, then the second, and so on. A one-dimensional array is subtly different from a vector:

y <- 1:5; y

output

[1] 1 2 3 4 5
z <- array(1:5,dim=5); z

output

[1] 1 2 3 4 5
y==z

output

[1] TRUE TRUE TRUE TRUE TRUE
identical(y,z)

output

[1] FALSE
dim(y)

output

NULL
dim(z)

output

[1] 5

Exercise

What happens when we set the dimension attribute on a vector? For example:

x <- seq(1,27)
dim(x) <- c(3,9)
is.array(x)
is.matrix(x)

Make sure you understand the above.


Factors

For dealing with measurements on the nominal and ordinal scales (Stevens 1946), R provides vectors of type factor. A factor is a variable that can take one of a finite number of distinct levels. To construct a factor, we can apply the factor function to a vector of any class:

x <- rep(c(1,2),each=3); factor(x)

output

[1] 1 1 1 2 2 2
Levels: 1 2
trochee <- c("jetpack","ferret","pizza","lawyer")
trochee <- factor(trochee); trochee

output

[1] jetpack ferret  pizza   lawyer 
Levels: ferret jetpack lawyer pizza

By default, factor sets the levels to the unique set of values taken by the vector. To modify that behavior, there is the levels argument:

factor(trochee,levels=c("ferret","pizza","cowboy","scrapple"))

output

[1] <NA>   ferret pizza  <NA>  
Levels: ferret pizza cowboy scrapple

Note that the order of the levels is arbitrary, in keeping with the fact that the only operation permissible on the nominal scale is the test for equality. In particular, the factors created with the factor command are un-ordered: there is no sense in which we can ask whether, e.g., ferret < cowboy.

To represent variables measured on the ordinal scale, R provides ordered factors, constructed via the ordered function. An ordered factor is just like an un-ordered factor except that the order of the levels matters:

x <- ordered(sample(x=letters,size=22,replace=TRUE)); x

output

 [1] z m u p a g c d b n x d g n w l c t z p b x
14 Levels: a < b < c < d < g < l < m < n < p < t < ... < z

Here, we’ve relied on ordered’s default behavior, which is to put the levels in alphabetical order. It’s typically safer to specify explicitly what order we want:

x <- ordered(x,levels=rev(letters))
x[1:5] < x[18:22]

output

[1]  TRUE FALSE  TRUE  TRUE FALSE

Exercise

Look up the documentation on the sample function used above.


Exercise

Can I make a matrix or an array out of a factor variable?


Exercise

What is the internal representation of factors in R ? Try converting factors to integers using as.integer. Try converting an integer vector to a factor using factor.


Lists

While vectors and matrices may seem familiar, lists may be new to you. Vectors and matrices have to contain elements that are all the same type: lists in R can contain anything—vectors, matrices, other lists, arbitrary objects. Indexing is a little different too: use [[ ]] to extract an element of a list by number or name or $ to extract an element by name (only). Given a list like this:

L <- list(A=x,B=trochee,C=c("a","b","c"))

Then L$A, L[["A"]], and L[[1]] will each return the first element of the list. To extract a sublist, use the ordinary single square brackets: []:

L[c("B","C")]

output

$B
[1] jetpack ferret  pizza   lawyer 
Levels: ferret jetpack lawyer pizza

$C
[1] "a" "b" "c"

Data frames

Vectors, matrices, and lists of one sort or another are found in just about every programming language. The data frame structure is (or was last time I checked) unique to R , and is central to many of R ’s useful data-analysis features. It’s very natural to want to store data in vectors and matrices. Thus, in the example above, we stored measurements of two variables (\(r_\text{max}\) and light level) in vectors. This was done in such a way that the observations of the first replicate were stored in the first element of each vector, the second in the second, and so on. To explicitly bind together observations corresponding to the same replicate, we might join the two vectors into a matrix using cbind. In the resulting data structure, each row would correspond to an observation, each column to a variable. This is possible, however, only because both variables are of the same type: they’re both numerical. More commonly, a data set is made up of several different kinds of variables. The data frame is R ’s solution to this problem.

Data frames are a hybrid of lists and vectors. Internally, they are a list of vectors which can be of different types but must all be the same length. However, they behave somewhat like matrices, in that you can do most things to them that you can do with matrices. You can index them either the way you would index a list, using [[ ]] or $—where each variable is a different item in the list—or the way you would index a matrix.
You can turn a data frame into a matrix (using as.matrix(), but only if all variables are of the same class) and a matrix into a data frame (using as.data.frame()).

When data are read into R from an external file using one of the read.xxx commands (read.csv, read.table, read.xls, etc.), the object that is created is a data frame.

data.url <- "https://kingaa.github.io/R_Tutorial/data/ChlorellaGrowth.csv"
dat <- read.csv(data.url,comment.char='#')
dat

output

 light rmax
    20 1.73
    20 1.65
    20 2.02
    20 1.89
    21 2.61
    24 1.36
    44 2.37
    60 2.08
    90 2.69
    94 2.32
   101 3.67

Exercise

Download the hurricanes.csv file from the tutorial website: https://kingaa.github.io/R_Tutorial/hurricanes.csv. Examine the resulting data frame by printing it and using the str command. Note the class type of each variable.


Probability distributions in R

R contains a great deal of distilled knowledge about probability distributions. In particular, for each of a large class of important distributions, methods to compute probability distribution functions (p.d.f., i.e., density or mass functions), cumulative distribution functions (c.d.f.), and quantile functions are available, as are methods for simulating these distributions (i.e., drawing random deviates with the desired distribution). Conveniently, these are all named using the same scheme, as shown in the following table.


Functions for working with probability distributions.
   
dxxx(x, ...) probability distribution function
pxxx(q, ...) cumulative distribution function
qxxx(p, ...) quantile function (i.e., inverse of pxxx)
rxxx(n, ...) simulator

In the above xxx stands for the abbreviated name of the specific distribution (the “R name” as given in following table). In each case, the ... indicates that additional, distribution-specific, parameters are to be supplied. The following table lists some of the more common distributions built in to R. A complete list of distributions provided by the base stats package can be viewed by executing ?Distributions.


The most commonly used built-in probability distributions.
Distribution R name Parameters Range
Discrete distributions
Binomial binom size, prob \(0,1,\dots,\mathrm{size}\)
Bernoulli binom (size=1) prob \(0,1\)
Poisson pois lambda \(0,1,\dots,\infty\)
Negative binomial nbinom size, (prob or mu) \(0,1,\dots,\infty\)
Geometric geom rate \(0,1,\dots,\infty\)
Hypergeometric hyper m, n, k \(0,1,\dots,m\)
Multinomial multinom size, prob \(\{(x_1,\dots,x_m): \sum\!x_i=\mathrm{size}\}\)
Continuous distributions      
Uniform unif min, max \([\mathrm{min},\mathrm{max}]\)
Normal norm mean, sd \((-\infty,\infty)\)
Gamma gamma shape, (scale or rate) \([0,\infty)\)
Exponential exp mu \([0,\infty)\)
Beta beta shape1, shape2 \([0,1]\)
Lognormal lnorm meanlog, sdlog \([0,\infty)\)
Cauchy cauchy location, scale \((\infty,\infty)\)
\(\chi^2\) chisq df \([0,\infty)\)
Student’s T t df \([0,\infty)\)
Weibull weibull shape, scale \([0,\infty)\)

For complete documentation, execute e.g., ?dbinom in an R session.


Exercise

Write codes to plot the c.d.f. of the binomial distribution (pbinom). Verify the relationship between the p.d.f. (dbinom) and the c.d.f.


Exercise

Plot the p.d.f. and c.d.f. of the Poisson distribution for several values of the parameter \(\lambda\).


Exercise

Plot the p.d.f. and c.d.f. of the gamma distribution for several values of the shape and scale (or rate) parameters. Put in a vertical line to indicate the mean. Be sure to explore both large and small values of the shape parameter.


Exercise

Plot the p.d.f. and c.d.f. of the beta distribution for several values of the first and second shape parameters. For each plot, put in a vertical line to indicate the mean. Be sure to explore the full ranges of the parameters.


Scripts and data files

Modeling and complicated data analysis are accomplished more efficiently using scripts, which are a series of commands stored in a text file. The Windows and MacOS versions of R both have basic script editors, but any text editor can be used, e.g. emacs (with ESS, “emacs speaks statistics”). You should never use MS Word or other word processors!

Most programs for working with models or analyzing data follow a simple pattern of program parts:

  1. Setup statements.
  2. Data input.
  3. Computations on the data.
  4. Output text and/or graphics.

For example, a script file might

  1. Load some packages, or “source” another script file that creates some functions (more on functions later).
  2. Read in data from a text file.
  3. Fit several statistical models to the data and compare them.
  4. Plot the results and save them to disk for inclusion in a paper.

Even for relatively simple tasks, script files are useful for building up a calculation step-by-step, making sure that each part works before adding on to it. They are also essential for making research reproducible.

Tips for working with data and script files (sounding slightly scary but just trying to help you avoid common pitfalls):

  • To let R know where data and script files are located, you have a few choices:
    1. change your working directory to wherever the file(s) are located before running R.
    2. change your working directory within an R session to wherever the file(s) are located using the setwd() (set working directory) function, e.g. setwd("c:/temp").
    3. spell out the path to the file explicitly.

The first option is far preferable, since it makes your codes portable: you can copy the whole directory wherever you like and R will always find what it needs.

  • Use a single forward slash to separate folders (e.g., Ⓦ "c:/My~Documents/R/script.R" or Ⓧ "/home/kingaa/projects/hurricanes"): this works on all platforms.

  • It’s important that script files be preserved as plain text and data files also as plain text, using comma- or tab-separated format.

There are several common situations that lead to files saved in bad formats:

  1. if you use a web browser to download files, be careful that it doesn’t automatically append some weird suffix to the files or translate them into something other than plain text.

  2. if your web browser uses “file associations” (e.g., it thinks that all files ending in .dat are Excel files), make sure to save the file as plain text, and without any extra extensions;

  3. never use Microsoft Word or other word-processors to edit your data and script files; MS Word will try very hard to get you to save them as Word (rather than text) files, which will screw them up!

  4. If you send script files by e-mail, even if you are careful to send them as plain text, lines will occasionally get broken in different places, which can lead to confusion. Beware.

As a first example, the file Intro1.R has the commands from the interactive regression analysis. Download Intro1.R and save it to your computer:

course.url <- "https://kingaa.github.io/R_Tutorial/"
download.file(paste0(course.url,"Intro1.R"),destfile="Intro1.R",mode="w")

Open your copy of Intro1.R. In your editor, select and Copy the entire text of the file, and then Paste the text into the R console window. This has the same effect as entering the commands by hand into the console: they will be executed and so a graph is displayed with the results. Cut-and-Paste allows you to execute script files one piece at a time (which is useful for finding and fixing errors). The source function allows you to run an entire script file, e.g.,

source("Intro1.R")

Another important time-saver is loading data from a text file. Grab copies of Intro2.R and ChlorellaGrowth.csv from the dropsite to see how this is done.

download.file(paste0(course.url,"Intro2.R"),destfile="Intro2.R",mode="w")
download.file(paste0(course.url,"data/ChlorellaGrowth.csv"),destfile="ChlorellaGrowth.csv",mode="w")

In ChlorellaGrowth.csv the two variables are entered as columns of a data matrix. Then instead of typing these in by hand, the command

X <- read.csv("ChlorellaGrowth.csv",comment.char='#')

reads the file (from the current directory) and puts the data values into the variable X. Note that as specified above you need to make sure that R is looking for the data file in the right place: make sure to download the data file into your current working directory. Note also the comment.char option in the call to read.csv; this tells R to ignore lines that begin with a # and allows us to use self-documenting data files (a very good practice).

Extract the variables from X with the commands

Light <- X[,1]
rmax <- X[,2]

Think of these as shorthand for “Light = everything in column 1 of X”, and “rmax = everything in column 2 of X” (we’ll learn about working with data frames later). From there on out it’s the same as before, with some additions that set the axis labels and add a title.


Exercise

Make a copy of Intro2.R under a new name, and modify the copy so that it does linear regression of algal growth rate on the natural log of light intensity, LogLight=log(Light), and plots the data appropriately. You should end up with a graph that resembles the following.

plot


Exercise

Run Intro2.R, then enter the command plot(fit) in the console and follow the directions in the console. Figure out what just happened by entering ?plot.lm to bring up the help page for the function plot.lm() that carries out a plot() command for an object produced by lm(). (This is one example of how R uses the fact that statistical analyses are stored as model objects. fit “knows” what kind of object it is (in this case an object of type lm), and so plot(fit) invokes a function that produces plots suitable for an lm object.) Answer: R produced a series of diagnostic plots exploring whether or not the fitted linear model is a suitable fit to the data. In each of the plots, the 3 most extreme points (the most likely candidates for “outliers”) have been identified according to their sequence in the data set.


The axes in plots are scaled automatically, but the outcome is not always ideal (e.g. if you want several graphs with exactly the same axes limits). You can control scaling using the xlim and ylim arguments in plot: plot(x,y,xlim=c(x1,x2), [other stuff]) will draw the graph with the \(x\)-axis running from x1 to x2, and using ylim=c(y1,y2) within the plot() command will do the same for the \(y\)-axis.


Exercise

Create a plot of growth rate versus light intensity with the \(x\)-axis running from 0 to 120, and the \(y\)-axis running from 1 to 4.


Exercise

Several graphs can be placed within a single figure by using the par function (short for “parameter”) to adjust the layout of the plot. For example the command par(mfrow=c(m,n)) divides the plotting area into \(m\) rows and \(n\) columns. As a series of graphs is drawn, they are placed along the top row from left to right, then along the next row, and so on. mfcol=c(m,n) has the same effect except that successive graphs are drawn down the first column, then down the second column, and so on.

Save Intro2.R with a new name and modify the program as follows. Use mfcol=c(2,1) to create graphs of growth rate as a function of Light, and of log(growth rate) as a function of log(Light) in the same figure.
Do the same again, using mfcol=c(1,2).


Exercise

Use ?par to read about other plot control parameters that can be set using par(). Then draw a \(2 \times 2\) set of plots, each showing the line \(y=5x+3\) with x running from 3 to 8, but with 4 different line styles and 4 different line colors.


Exercise

Modify one of your scripts so that at the very end it saves the plot to disk. You can accomplish this using the dev.print() function. Do ?dev.print to read about this function. Note that you need to specify the file argument to write your graphics to a file; if you don’t, it will try and send it to a printer. Also note that many alternative formats are available via the dev argument.


Looping in R

Very frequently, a computation involves iterating some procedure across a range of cases, and every computer language I’ve ever come across has one or more facilities for producing such loops. R is no exception, though judging by their code, many R programmers look down their noses at loops. The fact remains that, at some point in life, one simply has to write a for loop. Here, we’ll look at the looping constructs available in R.

for loops

Execute the following code.

phi <- 1
for (k in 1:100) {
  phi <- 1+1/phi
  print(c(k,phi))
}

What does it do? Sequentially, for each value of k between 1 and 100, phi is modified. More specifically, at the beginning of the for loop, a vector containing all the integers from 1 to 100 in order is created. Then, k is set to the first element in that vector, i.e., 1. Then the R expression from the { to the } is evaluated. When that expression has been evaluated, k is set to the next value in the vector. The process is repeated until, at the last evaluation, k has value 100.

As an aside, note that the final value of phi is the Golden Ratio, 1.618034.

As an example of a situation where a loop of some sort is really needed, suppose we wish to iterate the Beverton-Holt map (one of the simplest discrete-time models of population growth), \[\begin{equation*} N_{t+1} = \frac{a\,N_t}{1+b\,N_t}. \end{equation*}\] We simply have no option but to do the calculation one step at a time. Here’s an R code that does this

a <- 1.1
b <- 0.001
T <- seq(from=1,to=200,by=1)
N <- numeric(length(T))
n <- 2
for (t in T) {
  n <- a*n/(1+b*n)
  N[t] <- n
}

Spend some time to make sure you understand what happens at each line of the above. We can plot the population sizes \(N_t\) through time via

plot(T,N)

plot

Gotcha:

An alternative way to do the above might be something like

N <- numeric(length(T))
for (t in 1:length(T)) {
  n <- a*n/(1+b*n)
  N[t] <- n
}

Exercise

Check that this works with different vectors T. What happens when T has length 1? What happens when T has length 0? Why? To avoid this trap, it’s preferable to use seq_len, seq_along, or seq.int instead of the 1:n construction used above. This and many other R gotchas are described in the useful R Inferno (Burns 2012).


while loops

A second looping construct is the while loop. Using while, we can compute the Golden Ratio as before:

phi <- 20
k <- 1
while (k <= 100) {
  phi <- 1+1/phi
  print(c(k,phi))
  k <- k+1
}

What’s going on here? First, phi and k are initialized. Then the while loop is started. At each iteration of the loop, phi is modified, and intermediate results printed, as before. In addition, k is incremented. The while loop continues to iterate until the condition k <= 100 is no longer TRUE, at which point, the while loop terminates.

Note that here we’ve chosen a large number (100) of iterations. Perhaps we could get by with fewer. If we wanted to terminate the iterations as soon as the value of phi stopped changing, we could do:

phi <- 20
conv <- FALSE
while (!conv) {
  phi.new <- 1+1/phi
  conv <- phi==phi.new
  phi <- phi.new
}

Exercise

Verify that the above works as intended. How many iterations are needed?


Another way to accomplish this would be to use break to stop the iteration when a condition is met. For example

phi <- 20
while (TRUE) {
  phi.new <- 1+1/phi
  if (phi==phi.new) break
  phi <- phi.new
}

While this while loop is equivalent to the one before, it does have the drawback that, if the break condition is never met, the loop will go on indefinitely. An alternative that avoids this is to use a for loop with a large (but finite) number of iterations, together with break:

phi <- 3
for (k in seq_len(1000)) {
  phi.new <- 1+1/phi
  if (phi==phi.new) break
  phi <- phi.new
}

Exercise

Recompute the trajectory of the Beverton-Holt model using a while loop. Verify that your answer is exactly equivalent to the one above.


repeat loops

A third looping construct in R involves the repeat keyword. For example,

phi <- 12
repeat {
  phi.new <- 1/(1+phi)
  if (phi==phi.new) break
  phi <- phi.new
}

In addition, R provides the next keyword, which, like break, is used in the body of a looping construct. Rather than terminating the iteration, however, it aborts the current iteration and leads to the immediate execution of the next iteration.

For more information on looping and other control-flow constructs, execute ?Control.

Functions and environments

Definitions and examples

An extremely useful feature in R is the ability to write arbitrary functions. A function, in this context, is an algorithm that performs a specific computation that depends on inputs (the function’s arguments) and produces some output (the function’s value) and/or has some side effects. Let’s see how this is done.

Here is a function that squares a number.

sq <- function (x) x^2

The syntax is function (arglist) expr. The one argument in this case is x. When a particular value of x is supplied, R performs the squaring operation. The function then returns the value x^2:

sq(3); sq(9); sq(-2);

output

[1] 9

output

[1] 81

output

[1] 4

Here is a function with two arguments and a more complex body, as we call the expression that is evaluated when the function is called.

f <- function (x, y = 3) {
  a <- sq(x)
  a+y
}

Here, the body is the R expression from { to }. Unless the return codeword is used elsewhere in the function body, the value returned is always the last expression evaluated. Thus:

f(3,0); f(2,2); f(3);

output

[1] 9

output

[1] 6

output

[1] 12

Note that in the last case, only one argument was supplied. In this case, y assumed its default value, 3.

Note that functions need not be assigned to symbols; they can be anonymous:

function (x) x^5

output

function (x) x^5
(function (x) x^5)(2)

output

[1] 32

A function can also have side effects, e.g.,

hat <- "hat"
hattrick <- function (y) {
  hat <<- "rabbit"
  2*y
}
hat; hattrick(5); hat

output

[1] "hat"

output

[1] 10

output

[1] "rabbit"

However, the very idea of a function insists that we should never experience unintentional side effects. We’ll see how R realizes this imperative below.

An aside

If we want the function not to automatically print, we can wrap the return value in invisible():

hattrick <- function (y) {
  hat <<- "rabbit"
  invisible(2*y)
}
hattrick(5)
print(hattrick(5))

output

[1] 10

A function in R is defined by three components:

  • its formal parameters, i.e., its argument list,
  • its body, and
  • its environment, i.e., the context in which the function was defined.

R provides simple functions to interrogate these function components:

formals(hattrick)

output

$y
body(hattrick)

output

{
    hat <<- "rabbit"
    invisible(2 * y)
}
environment(hattrick)

output

<environment: R_GlobalEnv>

Function scope

Note: This section draws heavily, and sometimes verbatim, on §10.7 of the Introduction to R manual (R Core Team 2014a).

As noted above, a paramount consideration in the implementation of functions in any programming language is that unintentional side effects should never occur. In particular, I should be free to write a function that creates temporary variables as an aid to its computations, and be able to rest assured that no variables I create temporarily will interfere with any other variables I’ve defined anywhere else. To accomplish this, R has a specific set of scoping rules.

Consider the function

f <- function (x) {
  y <- 2*x
  print(x)
  print(y)
  print(z)
}

In this function’s body, x is a formal parameter, y is a local variable, and z is a free, or unbound variable. When f is evaluated, each of these variables must be bound to some value. In R , the free variable bindings are resolved—each time the function is evaluated—by first looking in the environment where the function was created. This is called lexical scope. Thus if we execute

f(3)

we get an error, because no object named z can be found. If, however, we do

z <- 10
f(3)

output

[1] 3
[1] 6
[1] 10

we don’t get an error, because z is defined in the environment, <environment: R_GlobalEnv>, of f. Similarly, when we do

z <- 13
g <- function (x) {
  2*x+z
}
f <- function (x) {
  z <- -100
  g(x)
}
f(5)  

output

[1] 23

The relevant value of z is the one in the environment where g was defined, not the one in the environment wherein it is called.

Nested functions and environments

In each of the following examples, make sure you understand exactly what has happened.

Consider this:

y <- 11
f <- function (x) {
  y <- 2*x
  y+x
}
f(1); y

output

[1] 3

output

[1] 11

Exercise

Why hasn’t y changed its value?


What about this?

y <- 0
f <- function (x) {
  2*x+y
}
f(1); y

output

[1] 2

output

[1] 0

Exercise

Why is the return value of f different?


How about the following?

g <- function (y) y
f <- function (x) {
  g <- function (y) {
    2*y
  }
  11+g(x)
}
f(2); g(2)

output

[1] 15

output

[1] 2

Exercise

Be sure you understand what’s happening at each line and why you get the results you see.


Another example:

y <- 11
f <- function (x) {
  y <- 22
  g <- function (x) {
    x+y
  }
  g(x)
}
f(1); y

output

[1] 23

output

[1] 11

Exercise

Which value of y was used?


Compare that with this:

y <- 11
f <- function (x) {
  y <- 22
  g <- function (x) {
    y <<- 7
    x+y
  }
  c(g(x),y)
}
f(1); y

output

[1] 8 7

output

[1] 11

and this:

y <- 11
f <- function (x) {
  g <- function (x) {
    x+y
  }
  g(x)
}
f(1); y

output

[1] 12

output

[1] 11

and this:

f <- function (x) {
  y <- 37
  g <- function (x) {
    h <- function (x) {
      x+y
    }
    h(x)
  }
  g(x)
}
f(1); y

output

[1] 38

output

[1] 11

Finally, consider the following:

rm(y)
f <- function (x) {
  g <- function (x) {
    y <<- 2*x
    y+1
  }
  g(x)+1
}
f(11); y

output

[1] 24

output

[1] 22
f(-2); y

output

[1] -2

output

[1] -4

Exercise

In the above five code snippets, make sure you understand the differences.


As mentioned above, each function is associated with an environment: the environment within which it was defined. When a function is evaluated, a new temporary environment is created, within which the function’s calculations are performed. Every new environment has a parent, the environment wherein it was created. The parent of this new environment is the function’s environment. To see this, try

f <- function () {
  g <- function () {
    h <- function () {
      cat("inside function h:\n")
      cat("current env: ")
      print(environment())
      cat("parent env: ")
      print(parent.frame(1))
      cat("grandparent env: ")
      print(parent.frame(2))
      cat("great-grandparent env: ")
      print(parent.frame(3))
      invisible(NULL)
    }
    cat("inside function g:\n")
    cat("environment of h: ")
    print(environment(h))
    cat("current env: ")
    print(environment())
    cat("parent env: ")
    print(parent.frame(1))
    cat("grandparent env: ")
    print(parent.frame(2))
    h()
  }
  cat("inside function f:\n")
  cat("environment of g: ")
  print(environment(g))
  cat("current env: ")
  print(environment())
  cat("parent env: ")
  print(parent.frame(1))
  g()
}
cat("environment of f: "); print(environment(f))
cat("global env: "); print(environment())
f()

output

environment of f: <environment: R_GlobalEnv>
global env: <environment: R_GlobalEnv>
inside function f:
environment of g: <environment: 0x55cfef96ed08>
current env: <environment: 0x55cfef96ed08>
parent env: <environment: R_GlobalEnv>
inside function g:
environment of h: <environment: 0x55cfe9c56ac8>
current env: <environment: 0x55cfe9c56ac8>
parent env: <environment: 0x55cfef96ed08>
grandparent env: <environment: R_GlobalEnv>
inside function h:
current env: <environment: 0x55cfe9b71038>
parent env: <environment: 0x55cfe9c56ac8>
grandparent env: <environment: 0x55cfef96ed08>
great-grandparent env: <environment: R_GlobalEnv>

Each variable referenced in the function’s body is bound, first, to a formal argument if possible. If a local variable of that name has previously been created (via one of the assignment operators <-, ->, or =, this is the variable that is affected by any subsequent assignments. If the variable is neither a formal parameter nor a local variable, then the parent environment of the function is searched for that variable. If the variable has not been found in the parent environment, then the grand-parent environment is searched, and so on.

If the assignment operators <<- or ->> are used, a more extensive search for the referenced assignee is made. If the variable does not exist in the local environment, the parent environment is searched. If it does not exist in the parent environment, then the grand-parent environment is searched, and so on. Finally, if the variable cannot be found anywhere along the lineage of environments, a new global variable is created, with the assigned value.

The apply family of functions

As mentioned above, there are circumstances under which looping constructs are really necessary. Very often, however, we wish to perform some operation across all the elements of a vector, array, or dataset. In such cases, it is faster and more elegant (to the R afficiando’s eye) to use the apply family of functions.

List apply: lapply

lapply applies a function to each element of a list or vector, returning a list.

x <- list("teenage","mutant","ninja","turtle",
          "hamster","plumber","pickle","baby")
lapply(x,nchar)

output

[[1]]
[1] 7

[[2]]
[1] 6

[[3]]
[1] 5

[[4]]
[1] 6

[[5]]
[1] 7

[[6]]
[1] 7

[[7]]
[1] 6

[[8]]
[1] 4
y <- c("teenage","mutant","ninja","turtle",
       "hamster","plumber","pickle","baby")
lapply(y,nchar)

output

[[1]]
[1] 7

[[2]]
[1] 6

[[3]]
[1] 5

[[4]]
[1] 6

[[5]]
[1] 7

[[6]]
[1] 7

[[7]]
[1] 6

[[8]]
[1] 4

Sloppy list apply: sapply

sapply isn’t content to always return a list: it attempts to simplify the results into a non-list vector if possible.

x <- list("pizza","monster","jigsaw","puddle",
          "hamster","plumber","pickle","baby")
sapply(x,nchar)

output

[1] 5 7 6 6 7 7 6 4
y <- c("pizza","monster","jigsaw","puddle",
       "hamster","plumber","pickle","baby")
sapply(y,nchar)

output

  pizza monster  jigsaw  puddle hamster plumber  pickle 
      5       7       6       6       7       7       6 
   baby 
      4 

Multiple-list apply: mapply

mapply is a multiple-argument version of sapply:

x <- c("pizza","monster","jigsaw","puddle")
y <- c("cowboy","barbie","slumber","party")
mapply(paste,x,y,sep="/")

output

           pizza          monster           jigsaw 
  "pizza/cowboy" "monster/barbie" "jigsaw/slumber" 
          puddle 
  "puddle/party" 

As usual, the recycling rule applies:

mapply(paste,x,y[2:3])

output

            pizza           monster            jigsaw 
   "pizza barbie" "monster slumber"   "jigsaw barbie" 
           puddle 
 "puddle slumber" 
mapply(paste,x[c(1,3)],y)

output

          pizza          jigsaw            <NA> 
 "pizza cowboy" "jigsaw barbie" "pizza slumber" 
           <NA> 
 "jigsaw party" 

Array apply: apply

apply is very powerful and a bit more complex. It allows an arbitrary function to applied to each slice of an array, where the slices can be defined in all possible ways. Let’s create a matrix:

A <- array(data=seq_len(15),dim=c(3,5)); A

output

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4    7   10   13
[2,]    2    5    8   11   14
[3,]    3    6    9   12   15

To apply an operation to each row, we marginalize over the first dimension (rows). For example, to sum the rows, we’d do

apply(A,1,sum)

output

[1] 35 40 45

To sum the columns (the second dimension), we’d do

apply(A,2,sum)

output

[1]  6 15 24 33 42

Now suppose we have a 3-dimensional array:

A <- array(data=seq_len(30),dim=c(3,5,2)); A

output

, , 1

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4    7   10   13
[2,]    2    5    8   11   14
[3,]    3    6    9   12   15

, , 2

     [,1] [,2] [,3] [,4] [,5]
[1,]   16   19   22   25   28
[2,]   17   20   23   26   29
[3,]   18   21   24   27   30

To sum the rows within each slice, we’d do

apply(A,c(1,3),sum)

output

     [,1] [,2]
[1,]   35  110
[2,]   40  115
[3,]   45  120

while to sum the slices, we’d do

apply(A,3,sum)

output

[1] 120 345

For each of the above, make sure you understand exactly what has happened.

Of course, we can apply an anonymous function wherever we apply a named function:

apply(A,c(2,3),function (x) sd(x)/sqrt(length(x)))

output

          [,1]      [,2]
[1,] 0.5773503 0.5773503
[2,] 0.5773503 0.5773503
[3,] 0.5773503 0.5773503
[4,] 0.5773503 0.5773503
[5,] 0.5773503 0.5773503

If any additional arguments are given to apply, these are passed to the function. For example, make sure you understand what is happening in the lines:

apply(A,c(1,2),function (x, y) sum(x>y),y=8)

output

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    2    2
[2,]    1    1    1    2    2
[3,]    1    1    2    2    2
apply(A,c(1,2),function (x, y) sum(x>y),y=-1)

output

     [,1] [,2] [,3] [,4] [,5]
[1,]    2    2    2    2    2
[2,]    2    2    2    2    2
[3,]    2    2    2    2    2

Table apply: tapply

tapply is, in a way, an extension of table. The syntax is tapply(X,INDEX,FUN,...), where X is a vector, INDEX is a list of one or more factors, each the same length as X, and FUN is a function. The vector X will be split into subvectors according to INDEX, and FUN will be applied to each of the subvectors. By default, the result is simplified into an array if possible. Some examples:

x <- seq(1,30,by=1)
b <- rep(letters[1:10],times=3)
data.frame(x,b)

output

  x b
  1 a
  2 b
  3 c
  4 d
  5 e
  6 f
  7 g
  8 h
  9 i
 10 j
 11 a
 12 b
 13 c
 14 d
 15 e
 16 f
 17 g
 18 h
 19 i
 20 j
 21 a
 22 b
 23 c
 24 d
 25 e
 26 f
 27 g
 28 h
 29 i
 30 j
tapply(x,b,sum)

output

 a  b  c  d  e  f  g  h  i  j 
33 36 39 42 45 48 51 54 57 60 
b <- rep(letters[1:10],each=3)
data.frame(x,b)

output

  x b
  1 a
  2 a
  3 a
  4 b
  5 b
  6 b
  7 c
  8 c
  9 c
 10 d
 11 d
 12 d
 13 e
 14 e
 15 e
 16 f
 17 f
 18 f
 19 g
 20 g
 21 g
 22 h
 23 h
 24 h
 25 i
 26 i
 27 i
 28 j
 29 j
 30 j
tapply(x,b,sum)

output

 a  b  c  d  e  f  g  h  i  j 
 6 15 24 33 42 51 60 69 78 87 

The following uses the seed predation data described by Bolker (2008).

datafile <- "https://kingaa.github.io/R_Tutorial/data/seedpred.dat"
seeds <- read.table(datafile,header=TRUE,
                    colClasses=c(station='factor',dist='factor',date='Date'))
x <- subset(seeds,available>0)
with(x, tapply(tcum,list(dist,station),max,na.rm=TRUE))

output

    1  10 100 101 102 103 104 105 106 107 108 109 11 110
10 24 248 122 248  10  17  39  17  46  10   3  17 NA  10
25 18 123  11 249 249  11  11 249  81  11 123 249  4  25
   111 112 113 114 115 116 117 118 119  12 120 121 122 123
10  10  17 248 248  67 248 248  24  10 248  17 248 248   3
25 249  47 249 249  40  68  11  32  18 249  32  11 249  68
   124 125 126 127 128 129  13 130 131 132 133 134 135 136
10 248 248  10  67  17 248 248 248   3 248 248  10  10   3
25 249 249  18  NA  18 249 249  18  18  40  54  18   4  40
   137 138 139 14 140 141 142 143 144 145 146 147 148 149
10 248 248   3 24 248 248 248 248 248 248 248  24 248 248
25  61 249   4 61 249  68  11  NA   4 249 209   4 249  81
   15 150 151 152 153 154 155 156 157 158 159  16 160  17
10  3 248  39 248 248 248  NA  53 248 248  53 248  NA 248
25 11 249 249 249 249 249 249 123 249 249  32  25 249   4
    18  19   2  20  21 22 23  24  25  26 27  28  29  3 30
10 136  24 248 248 248 24 24 248  24 248 53  31  80 31 NA
25 249 123  47  54 249 61 40 249 249 249  4 249 249 11  4
   31  32 33 34  35 36 37 38  39   4  40  41  42 43 44  45
10 NA 248 17 24 248 39 17  3  10  31 248 122 248 NA  3  17
25 NA  18  4  4 159  4 11 NA 249 249   4  NA 123 NA  4 249
   46 47 48 49   5 50  51 52 53  54 55 56 57  58 59  6 60
10 NA NA NA NA 248  3 248  3 10 248  3  3 24   3 NA 10  3
25 11 18  4 11 249 40  18 11 11 249  4 18 11 159 25 11 11
    61 62 63 64  65  66 67 68 69  7 70 71 72  73  74 75  76
10   3  3  3  3 248 248 NA 10 17 10  3 NA 10 248 248 NA 248
25 209  4  4 11 249  11 40 11 61 25  4  4  4 249 159 32 249
   77 78 79   8 80  81 82 83  84  85 86 87 88 89   9 90 91
10  3 10 NA  NA  3 248 80 NA 241 122 24 10 10 24 115 17 NA
25 47 18  4 249 11  NA 11 11  11 249 11 NA 11 11 123 11 11
   92 93 94 95 96 97 98 99
10 17 10  3  3 10 10 10 17
25 25 11  4 11 18 18 11  4

sapply with expected result: vapply

When we could use sapply and we know exactly what the size and class of the value of the function will be, it is sometimes faster to use vapply. The syntax is like that of sapply: vapply(X,FUN,FUN.VALUE,...), where X and FUN are as in sapply, but we specify the size and class of the value of FUN via the FUN.VALUE argument. For example, suppose we define a function that, given a number between 1 and 26 will return the corresponding letter of the alphabet:

alph <- function (x) {
  stopifnot(x >= 1 && x <= 26)
  LETTERS[as.integer(x)]
}

This function will return a vector of length 1 and class character. To apply it to a randomly sampled set of integers, we might do

x <- sample(1:26,50,replace=TRUE)
y <- vapply(x,alph,character(1))
paste(y,collapse="")

output

[1] "ORIFWKBWAMEJBZUMLKRJVEOIKQXEYRNTGAXXXUVQTZIHQGANTI"

Vectorized functions vs loops

As Ligges and Fox (2008) point out, the idea that one should avoid loops wherever possible in R, using instead vectorized functions like those in the apply family, is quite widespread in some quarters. The belief, which probably dates back to infelicities in early versions of S and Splus but is remarkably persistent, is that loops are very slow in R. Let’s have a critical look at this.

Consider the following loop code that can be vectorized:

x <- runif(n=1e6,min=0,max=2*pi)
y <- numeric(length(x))
for (k in seq_along(x)) {
  y[k] <- sin(x[k])
}

To time this, we can wrap the vectorizable parts in a call to system.time:

x <- runif(n=1e6,min=0,max=2*pi)
system.time({
  y <- numeric(length(x))
  for (k in seq_along(x)) {
    y[k] <- sin(x[k])
  }
})

output

   user  system elapsed 
  0.090   0.001   0.092 

We can compare this with a simple call to sin (which is vectorized):

system.time(z <- sin(x))

output

   user  system elapsed 
  0.013   0.008   0.022 

Clearly, calling sin directly is much faster. What about using sapply?


Exercise

Compare the time spent on the equivalent calculation, using sapply. What does this result tell you about where the computational cost is incurred?


The above example is very simple in that there is a builtin function (sin in this case) which is capable of the fully vectorized computation. In such a case, it is clearly preferable to use it. Frequently, however, no such builtin function exists, i.e., we have a custom function of our own we want to apply to a set of data. Let’s compare the relative speeds of loops and sapply in this case.

x <- seq.int(from=20,to=1e6,by=10)
f <- function (x) {
  (((x+1)*x+1)*x+1)*x+1
}
system.time({
  res1 <- numeric(length(x))
  for (k in seq_along(x)) {
    res1[k] <- f(x[k])
  }
})

output

   user  system elapsed 
  0.066   0.000   0.067 
system.time(res2 <- sapply(x,f))

output

   user  system elapsed 
  0.065   0.000   0.065 

[Actually, in this case, f is vectorized automatically. Why is this?]

system.time(f(x))

output

   user  system elapsed 
  0.001   0.000   0.001 

Another example: in this case function g is not vectorized.

g <- function (x) {
  if ((x[1] > 30) && (x[1] < 5000)) 1 else 0
}

system.time({
  res1 <- numeric(length(x))
  for (k in seq_along(x)) {
    res1[k] <- g(x[k])
  }
})

output

   user  system elapsed 
  0.065   0.000   0.064 
system.time(res2 <- sapply(x,g))

output

   user  system elapsed 
  0.067   0.000   0.067 

Acknowledgments

These notes are based in part on materials used in the series of NSF-funded workshops taught in association with the Ecology and Evolution of Infectious Diseases conference. Instructors there included Ben Bolker, Ottar Bjørnstad, Steve Ellner, and Stu Field. Steve Ellner contributed the document that formed the early skeleton of these notes; he based them in turn on course materials by Colleen Webb, Jonathan Rowell, and Daniel Fink at Cornell, Professors Lou Gross (University of Tennessee) and Paul Fackler (NC State University), and on the book Getting Started with Matlab by Rudra Pratap (Oxford University Press). The document also draws on the documentation supplied with R, including the Introduction to R manual. Versions of this document have been used in courses taught by BB and AAK at Florida and Michigan, at workshops run under the auspices of the International Centre for Theoretical Physics in Trieste, and elsewhere.


References

Bolker BM (2008). Ecological Models and Data in R. Princeton University Press. Retrieved from http://ms.mcmaster.ca/~bolker/emdbook/

Burns P (2012). The R Inferno 2nd ed. http://www.burns-stat.com/. Retrieved from http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Chambers JM (1998). Programming with Data. Springer, New York. Retrieved from http://cm.bell-labs.com/cm/ms/departments/sia/Sbook/

Fussman GF, Ellner SP, Sherzer KW, Nelson G. Hairston J (2000). “Crossing the Hopf Bifurcation in a Live Predator-Prey System.” Science, 290, 1358–60.

Ihaka R, Gentleman R (1996). “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics, 5, 299–314.

Ligges U, Fox J (2008). “R Help Desk : How Can I Avoid This Loop or Make It Faster?” R News, 8(1), 46–50. Retrieved from http://www.r-project.org/doc/Rnews/Rnews_2008-1.pdf

R Core Team (2014a). An Introduction to R. Retrieved from http://cran.r-project.org/doc/manuals/r-release/R-intro.html

R Core Team (2014b). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from http://www.R-project.org/

Stevens SS (1946). “On the Theory of Scales of Measurement.” Science, 103(2684), 677–680. https://doi.org/10.1126/science.103.2684.677.

Venables WN, Ripley BD (2002). Modern Applied Statistics with S. Fourth Edition. Springer, New York. Retrieved from http://www.stats.ox.ac.uk/pub/MASS4/


Top of this document
Previous page
Home page
Issues
Source code for these notes
CC-BY_NC

Licensed under the Creative Commons Attribution-NonCommercial license. Please share and remix noncommercially, mentioning its origin.