The goal of this document is to make sure you have the relevant R knowledge to follow the courses offered during the BEC master program. What is described below corresponds loosely to what is taught during the bachelor program at UNIL.

The book Getting started with R: an introduction for biologists, with material overlapping very much with what follows can be consulted online once you have a UNIL affiliation.

The document is divided in 5 practicals, with exercises at the end of each. Make sure you try to do these exercises. If you tried and want to see the solutions, (send me an email: jerome.goudet.[at].unil.ch) with your attempt at answering the exercises. Good luck!

Setting things up

Prerequisites

What you must first install: R and Rstudio

Introduction

Documentation and setting things up

What is R: http://stat.ethz.ch/CRAN/doc/FAQ/R-FAQ.html

R editors:

Why an editor? syntax highlighting, plus avoid re-typing commands…

For All systems: integrated environment RStudio ( http://www.rstudio.com/ ) Highly recommended (i.e., install it if not done yet and use it for this course). We will use R markdown to write notes and produce documents. A very brief video about RStudio

Installing and enhancing R

  1. Binaries for windows, macOs X, several brends of linux and source code available from CRAN
  2. extra packages to extend R: from the same URL. The CRAN package repository features \(> 18,000\) available packages
  3. Also many packages available from github (development versions)
  4. R Home page for further information etc… (bookmark this link!)
  5. Search Engine specific for R
  6. Much on CRAN in forms of tutorial, FAQ, etc…
  7. Books a lot of R books available from this web site

Practical 1: Data manipulation in R

Introduction

If you’ve never used R and R-studio before, you can watch this video TO BE UPDATED, create our own video

In this first part, we review data structures in R, and how to import and export data.

You should be able to answer the questions at the end of this section after having read and understood the text below. An advice: type (rather than copy/paste) the examples in R, it helps!

To interact with R, you must enter commands at the prompt

R as a calculator

The most basic way to use Ris as a super calculator. In the console window, enter the following commands:

2+3
## [1] 5
2*3
## [1] 6
2^3
## [1] 8
2^{-1}
## [1] 0.5
exp(2)
## [1] 7.389056
log(2)
## [1] 0.6931472
log10(2)
## [1] 0.30103
2e-03
## [1] 0.002
2e+03
## [1] 2000

The basic objects in R are vectors, matrices, lists and “data frames”.

Vectors

R operates on structured data bearing a name. The simplest structure is a vector, which is a simple entity consisting of an ordered collection of numbers. For instance,

x.vector<-c(5.1,6.3,7.8,9.3,10.5)

is a vector made up of five numbers. The order above uses the symbol of assignment ‘<-’ and the function ‘c’ who combines the five values in a vector. Note that the assignment can also be made in the other direction:

c(5.1,6.3,7.8,9.3,10.5)->x.vector

If an expression is used without being assigned to an object, it is printed to the screen then is lost. For example the instruction

1/x.vector
## [1] 0.1960784 0.1587302 0.1282051 0.1075269 0.0952381

prints the inverse of the fives values, but does not store them.

The vectors can be used in arithmetic expressions, the operations being then carried out on each element of the vector. For instance, look at the result of the 3 instructions below:

x.vector-1
## [1] 4.1 5.3 6.8 8.3 9.5
2*(x.vector-1)
## [1]  8.2 10.6 13.6 16.6 19.0
x.vector^2
## [1]  26.01  39.69  60.84  86.49 110.25

where ‘^’ stands for power

let’s make a second vector:

y.vector<-c(1.3,3.4,8.1,3.6,5.7)

with the same length as x.vector. Look at the results of the following instructions:

x.vector+y.vector
## [1]  6.4  9.7 15.9 12.9 16.2
x.vector*y.vector
## [1]  6.63 21.42 63.18 33.48 59.85

Note that vectors of different lengths can be used in an operation. In this case, the size of the shortest vector is adjusted by duplicating it until it reaches the size of the longest vector. If the longest vector is not a multiple of the short vector, then a warning is emitted:

x2<-c(1,1)
y5<-c(1,2,3,4,5)
x2+y5
## [1] 2 3 4 5 6

Warning message: In x2 + y5 : longer object length is not a multiple of shorter object length

In top of the common operations, you can also used the functions ‘log’, ‘log10’, ‘exp’, ‘sin’, ‘cos’, ‘tan’, ‘sqrt’ etc… ‘range’ gives minima and maxima of teh vector or matrix:

range(x.vector)
## [1]  5.1 10.5
c(min(x.vector),max(x.vector))
## [1]  5.1 10.5

the number of items in a vector can be obtained using ‘length’:

length(x.vector)
## [1] 5

thus, the valid indices for the elements of this vector are 1:length(vecteur.x) (: between 2 numbers is an instruction to generate the sequence of integers between the 2 numbers). Elements from a vector can be extracted by specifying the element number between square brackets ‘[]’. For instance, item 3 of x.vector is:

x.vector[3]
## [1] 7.8

and the instruction

x.vector[1:3]
## [1] 5.1 6.3 7.8

extract the first 3 elements of the vector. Two useful functions are the sum and product of the element of a vector:

sum(x.vector)
## [1] 39
prod(x.vector)
## [1] 24472.46

Logical vectors

Vectors can be made of logical elements:

c(TRUE,FALSE,TRUE,TRUE,FALSE,TRUE)
## [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE
c(T,F,T,T,F,T)
## [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE

where T stands for TRUE and F for FALSE. Beware, it is dangerous to use this shortcuts…

Character vectors

We can also have character strings

c("This","is","my","first","R","course")
## [1] "This"   "is"     "my"     "first"  "R"      "course"

Sequences

for a sequence of digits, use ‘:’:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

If you want it in steps of 2, type:

2*1:10
##  [1]  2  4  6  8 10 12 14 16 18 20

This works because ‘:’ has higher precedence than “*“. To obtain the reverse sequence:

10:1
##  [1] 10  9  8  7  6  5  4  3  2  1

A more general function to obtain a sequence is the function seq (try ?seq):

seq(-2*pi,3*pi,pi/4)
##  [1] -6.2831853 -5.4977871 -4.7123890 -3.9269908 -3.1415927 -2.3561945
##  [7] -1.5707963 -0.7853982  0.0000000  0.7853982  1.5707963  2.3561945
## [13]  3.1415927  3.9269908  4.7123890  5.4977871  6.2831853  7.0685835
## [19]  7.8539816  8.6393798  9.4247780

And another useful function is rep:

rep(1,10)
##  [1] 1 1 1 1 1 1 1 1 1 1
rep(1:5,2)
##  [1] 1 2 3 4 5 1 2 3 4 5

Two other useful functions are ‘sort’ and ‘order’:

y<-c(3.1,6.3,2.1,4.4,-1.9)
sort(y)
## [1] -1.9  2.1  3.1  4.4  6.3
order(y)
## [1] 5 3 1 4 2
#compare next expression with sort(y)
y[order(y)]
## [1] -1.9  2.1  3.1  4.4  6.3

Matrices

Matrices are obtained through the instruction matrix:

m1<-matrix(1:12,nrow=3)
m1
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

By default (see ?matrix), matrices are filled in columns. If you want to fill them by rows, use:

m2<-matrix(1:12,nrow=3,byrow=T)
m2  
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
(m2<-matrix(1:12,nrow=3,byrow=T)) # adding () arounds an expression prints it
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

Matrix dimension are obtained using dim:

dim(m1)
## [1] 3 4

Note that a vector does not have a dimension (although it has a length):

dim(x.vector)
## NULL

You can constrain a vector to become a matrix:

m.x.vector<-as.matrix(x.vector)
dim(m.x.vector)
## [1] 5 1

You can stick matrices together by rows or columns:

rbind(m1,m2)
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## [4,]    1    2    3    4
## [5,]    5    6    7    8
## [6,]    9   10   11   12
cbind(m1,m2)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    4    7   10    1    2    3    4
## [2,]    2    5    8   11    5    6    7    8
## [3,]    3    6    9   12    9   10   11   12

Elements from a matrix are extracted using square brackets:

m1[3,2]
## [1] 6
m1[2,3]
## [1] 8

We can access to a full line

m1[2,]
## [1]  2  5  8 11

or column:

m1[,2]
## [1] 4 5 6

We can also remove a line or a column from a matrix:

m1[-2,]
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    3    6    9   12
m1[,-2]
##      [,1] [,2] [,3]
## [1,]    1    7   10
## [2,]    2    8   11
## [3,]    3    9   12

elementwise multiplication of a matrix can be carried out:

m1*m2
##      [,1] [,2] [,3] [,4]
## [1,]    1    8   21   40
## [2,]   10   30   56   88
## [3,]   27   60   99  144

And classic matrix product is otained using %*%:

m1 %*% t(m2)
##      [,1] [,2] [,3]
## [1,]   70  158  246
## [2,]   80  184  288
## [3,]   90  210  330
t(m1) %*% m2
##      [,1] [,2] [,3] [,4]
## [1,]   38   44   50   56
## [2,]   83   98  113  128
## [3,]  128  152  176  200
## [4,]  173  206  239  272

where t stands for transpose (by the way, why don’t we take the product m1 %*% m2?).

Lists

A list in R is an object containing an ordered list of objects, its components. Objects in a list do not have to be of the same type. A new list is created using list

(my.list<-list("first element"=T, "second element"="Coucou","third"=c(4,3),"fourth"=9.99))
## $`first element`
## [1] TRUE
## 
## $`second element`
## [1] "Coucou"
## 
## $third
## [1] 4 3
## 
## $fourth
## [1] 9.99

To get to a specific element of a list, double square brackets [[]] are used:

my.list[[2]]
## [1] "Coucou"

Note that to obtain several elements, we use simple square brackets:

my.list[1:3]
## $`first element`
## [1] TRUE
## 
## $`second element`
## [1] "Coucou"
## 
## $third
## [1] 4 3

Last, it is convenient to call an element in a list by its name rather than its position:, using $ followed by the fist discriminatory characters of the element’s name:

my.list$fir
## [1] TRUE

(here, the first 2 letters would have been sufficient)

my.list$s #second element
## [1] "Coucou"

the names of the elements in a list can be obtained with names:

names(my.list)
## [1] "first element"  "second element" "third"          "fourth"

and the internal structure of a list can be obtained with str:

str(my.list)
## List of 4
##  $ first element : logi TRUE
##  $ second element: chr "Coucou"
##  $ third         : num [1:2] 4 3
##  $ fourth        : num 9.99

Data frames

Data frames can be considered as list of data vectors of the same length, but not necessarily of the same type (contrary to matrices).
The main idea behind data frames is to group data corresponding to an entity of observation in a single structure. It’s similar in concept to your familiar excel spreadsheet. Elements from a data frame can be accessed in the same way as those of a list or of a matrix:

a<-1:12
b<-12:1
cc<-letters[1:12] #why cc rather than c?
d<-rep(c(TRUE,FALSE),each=6) #careful, T & F are dangerous, use TRUE and FALSE instead
dataf<-data.frame(A=a,B=b,C=cc,D=d)
dataf
##     A  B C     D
## 1   1 12 a  TRUE
## 2   2 11 b  TRUE
## 3   3 10 c  TRUE
## 4   4  9 d  TRUE
## 5   5  8 e  TRUE
## 6   6  7 f  TRUE
## 7   7  6 g FALSE
## 8   8  5 h FALSE
## 9   9  4 i FALSE
## 10 10  3 j FALSE
## 11 11  2 k FALSE
## 12 12  1 l FALSE
dataf[,2]
##  [1] 12 11 10  9  8  7  6  5  4  3  2  1
dataf$B
##  [1] 12 11 10  9  8  7  6  5  4  3  2  1
names(dataf)
## [1] "A" "B" "C" "D"
str(dataf)
## 'data.frame':    12 obs. of  4 variables:
##  $ A: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ B: int  12 11 10 9 8 7 6 5 4 3 ...
##  $ C: chr  "a" "b" "c" "d" ...
##  $ D: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

square brackets allow selecting part of a data frame.
The double equal sign == allows to select elements equal to a criterion . For instance in data frame dataf to select the element of d equal to TRUE:

dataf$D[dataf$D==TRUE]
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

To select the element of A for which D is equal to TRUE:

dataf$A[dataf$D==TRUE]
## [1] 1 2 3 4 5 6

And to select all the columns of the data frame for which D is TRUE:

dataf[dataf$D==TRUE,] #why the comma? 
##   A  B C    D
## 1 1 12 a TRUE
## 2 2 11 b TRUE
## 3 3 10 c TRUE
## 4 4  9 d TRUE
## 5 5  8 e TRUE
## 6 6  7 f TRUE

Reading data into and saving data from R

Saving (parts of) the R environment

To save your entire environment (all the objects you see in the environment windows), enter:

save.image("mydata.RData")

You can save a subset of your environment using save with the names of the objects to save as arguments:

save(obj1,obj2,obj3,file="mydata.RData")

To save a single object, use saveRDS:

saveRDS(myobj,file="myobj.RDS")

Reading data into R from saved R binary files

If you saved previously an R environment with the command save(), you can load it using:

load("mydata.RData")

If you save an object using saveRDS, you can load it with:

myobj<-readRDS("myobj.RDS")

Reading data into R from text files

A extremely useful command is read.table, allowing to load a text file in the environment, and to associate it with a data frame.

If for instance, you want to read the the content of file ‘myfile.txt’ in R, you just need to type the command:

my.data<-read.table("myfile.txt",header=T)
my.data

For the command to work of course, the file needs to be in the working directory. To set the working directory use the command setwd("d:/mydir") (note the /, not ), and to locate the current working directory use ‘getwd()’.

If the file is on the internet, you can just enter the URL. For instance, if your file is at URL:

http://www.unil.ch/popgen/teaching/R11

You can load it with read.table using the full URL of the file:

my.data<-read.table("http://www.unil.ch/popgen/teaching/R11/myfile.txt",
header=T)
head(my.data)

Sometimes, for very large data sets, read.table is very slow because it will attempt to guess the type of each column. If the data are all of the same type, you might want to use scan (see ?scan) instead of read.table.

Writing Data to text files

Saving a data frame as a text file:

write.table(my.data,"mydata.txt",row.names=FALSE,quote=FALSE,sep="\t")

Questions

Stars preceding questions describe their difficulties, the more stars, the more difficult the question.

Questions

Answer questions 1 to 10 below:

First, clean your R environment by issuing the instruction rm(list=ls())

  1. (*) Why should I avoid calling objects c, T or F?

  2. (*) For the following numbers 7.3, 6.8, 0.005, 9, 12, 2.4, 18.9, 0.9

    1. Find their mean,
    2. substract it from all the numbers,
    3. get their square root,
    4. print the numbers that are larger than their square root
  3. (*) use seq and rep to generate the following vectors:

    1. 1; 2; 3; 4; 1; 2; 3; 4; 1; 2; 3; 4; 1; 2; 3; 4;
    2. 10; 10; 10; 9; 9; 9; 8; 8; 8; 7; 7; 7;
    3. 1; 2; 2; 3; 3; 3; 4; 4; 4; 4; 5; 5; 5; 5; 5;
    4. 1; 1; 3; 3; 5; 5; 7; 7; 9; 9;
  4. (**) Create the following matrix, as simply as possible:

 3 1 1 1
 1 3 1 1
 1 1 3 1
 1 1 1 3

4a. (*) multiply this matrix elementwise with itself. Get its matrix product by itself

  1. (**) Solve the following system (have a look at ?solve)
 2x + 3y - z = 4
 5x - 10y + 2z = 0
 x + y - 4z = 5
  1. (*) create a list mylist with the following elements:

    • numbers 1 to 5. call it nb1
    • numbers 1 to 50 by 0.375 steps. call it nb3
    • the string “My mummy told me”. call it char1
    • A vector made of the strings “apples”, “bananas”, “pears”. call it char2
    • The boolean vector FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE . Call it bool
  2. (* / **) What’s the list’s length? calculate the mean of the first element; of the second; of the first and second

  3. (**) paste the third and fourth element of the list (use paste); Which instruction would give the sentence “My mummy told me: apples, bananas, pears” (be careful, punctuation is important!)

  4. (*) Create a data frame with the data contained in file parus1.txt located at http://www2.unil.ch/popgen/teaching/R11

  5. (*) Create a vector data.fly containing the observations from alldat for which treat is equal to “mouche”. Add to data the element: “puce”, 12.7300

  6. (**) Try loading the file alien.txt (from http://www2.unil.ch/popgen/teaching/R11). What is wrong with this file? Fix it.

  7. (**) Try loading the contents of the file class2005.xls What needs to be done before you can load it into R?


Practical 2: Basic graphing functions, scatter plot and regression

Introduction

Here we will discover the different possibilities R has for plotting data, and we will also refresh the notion of variance, covariance, correlation and regression.

Graphs

R can do scatter plots, histograms, pie charts and bar plots, boxplots and violin plots.

If you are not using RStudio, before anything, you need to define a graphic windows. on windows, use windows(), on mac, use quartz(), on unix/linux, use X11();

scatter plots

We will start with the versatile plot function.

Plots can also be sent directly to specific files, without being printed on screen. available devices are pdf(), postscript(),png(),jpeg() among others. once you call such a device, you need to close it, by using the function dev.off(). pdf & postscript can have multiple pages, but not the others.

For drawing any function, use plot:

x<- -10:10
plot(x,sin(x))

What is the difference between the previous and the command plot(sin(x))

The plot instruction contains numerous arguments (?plot for more). For instance, one can gives labels to the axes, a title, the printing color and symbols etc..

x<-seq(-10,10,0.01) #more on seq soon
plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),ylab="f(x)")

x<-seq(0,4,0.1)
plot(x,exp(-(x^3+2*x^2-3*x+1)),ylab="exp(f(x))")

x<-seq(0,4,0.01)
plot(x,exp(-(x^3+2*x^2-3*x+1)),ylab="exp(f(x))",col="red")

plot default is to use points. using the type argument, we can specify whether we want points and lines (type=“b”) or a line (type=“l”) or nothing (type=“n”):

x<-seq(-10,10,0.01)
plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="b")

plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="l")

plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="n")

If you want to see the 2 graphs in one window with two panels, you can specify the numbers of panels to show, using the par function (see ?par):

par(mfrow=c(1,2)) #split graphic window in 1 row and 2 columns
x<-seq(-10,10,0.01)
plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12))
plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="l")

par(mfrow=c(1,1))#to go back to one panel

We can also add several functions to the same graph. After first using plot, which will define the first thing to be printed and axes length, labels etc, you can use the functions lines, text or points:

x<-seq(0,2,0.05)
ymax<-10
plot(x,x,ylim=c(0,ymax),ylab="f(x)")
lines(x,x^0.5,col="blue")
points(x,x^2,col="red")
text(x,x^3,labels="3",col="green")
text(x,x^3,labels="3",col="green")
segments(0.25,4,1.25,8,col="purple",lwd=2)

Note that several call to plot will erase each other in turn.

plot is of course useful for showing scatter plot of 2 variables, but could also be used for a single vector:

a<-rnorm(100);
plot(a)
b<-which(abs(a)>1.96)
#now create a vector specifying a red color for values
#larger than 1.96
coul<-rep("black",100);
coul[b]<-"red"
points(a,col=coul) #print these points in red
lines(a,type="h",col=coul) #type="h" draws a vertical bar from the x axis

histograms and boxplots

Other useful plots are histograms (?hist) and boxplots (?boxplot):

hist(a)

hist(a,breaks=-40:40/10)

boxplot(a)

Boxplots are often used to compare groups of observations:

a1<-rnorm(100)
b1<-rnorm(100,mean=2)
c1<-rnorm(100,sd=3)
boxplot(a1,b1,c1)

Bar plots and pie charts

For categorical data, we use barplots or piecharts:

categ<-rep(c("white","black","red"),c(20,20,40))
#We first need to make a count of the number of each color, using table:
tcateg<-table(categ)
barplot(tcateg,col=c("black","red","white"))

pie(tcateg,col=c("black","red","white"))

list of colors predefined in R : colors()

For a tour of R graphics ability, type:

demo(graphics)

Variance, covariances, correlations

Variance, covariance and correlations are obtained using var, cov and cor respectively. It is possible to use these functions to obtain the correlation between 2 vectors, or all variables (columns) of a data frame:

cor(x,y);
cor(d);

These functions don’t like missing data. If you have such data, you need to specify how to remove them. You could either remove the whole line containing one missing observation, (use="complete.obs"), or eliminate them only when they turn up in a calculation (use="pairwise.complete.obs"). Compare the results of the following instructions:

x<-c(1,2,3,4,NA); y<-c(1,2,3,4,-100); z<-c(4,3,2,1,10)
d<-data.frame(x,y,z)
cor(d, use="complete.obs")
##    x  y  z
## x  1  1 -1
## y  1  1 -1
## z -1 -1  1
cor(d, use="pairwise.complete.obs")
##    x          y          z
## x  1  1.0000000 -1.0000000
## y  1  1.0000000 -0.9561118
## z -1 -0.9561118  1.0000000

these functions return one element only (a number or a matrix. More complex functions, for instance lsfit (?lsfit) or lm (?lm), return a list of elements. To access each element individually use $. For instance:

fit<-lsfit(x,y)
fit$coef
lm(y~x)$coef

To get the names of all the elements of an object, use names().

A word on lsfit. It allows to estimate, using least square, the slope and intercept of a a regression line \(Y=a+X*b\)
The intercept a is the value of \(Y\) for which \(X=0\). The slope expresses how many units will Y varies when \(X\) changes one unit.
We’ll see lm in more details later

Also note that the package ggplot :

#if package not yet installed (see practical 7)
install.packages(ggplot2)
library(ggplot2) #?ggplot for help 

has yet many more options and capabilities.

Last, a book specifically about R graphics: Murrell Paul (2006) R Graphics


Questions

Stars preceding questions describe their difficulties, the more stars, the more difficult the question.

Questions
  1. (*) Load the file http://www.unil.ch/popgen/teaching/R11/data_etudiants_biol_0506.txt
    • Rename columns with more meaningful names (e.g. sex, height, weight, shoesize etc…)
    • Produce a bar chart of the proportion of each sex and the proportion of the different hair colors; produce a bar chart of the proportion of smoking men, non smoking men, smoking women, non-smoking women (use barplot; hint: you might need to use the table function)
    • Produce a piechart of the proportion of eye colors
    • Produce a boxplot of sizes as a function of sex
    • (**) Plot the scatterplot of weigths against height , using the sex of individuals as label, and labelling your axes. Use a different color for each sex. Explore the different arguments of the plot function: add a title, play with the size of characters. Try adding the regression line for males and that of females to the graph, using different colors for each sex. From this graph, would you say that the relation between weight and size differ between the two sexes? Which statistical test would be appropriate to test this hypothesis?
    • Produce a box plot of weight as a function of the smoking status and sex
    • (**) Produce a two panels graph (a panel for males and the other for females) of the scatterplot of weight versus height. Differentiate smokers from non smokers using a color argument
    • (**) If you have the time and envy, explore the package ggplot2 and find the commands necessary to produce the previous plots with it.
  2. (*) What is the correlation between weight and size in men? women? And the covariance?
  3. (*) load the file http://www.unil.ch/popgen/teaching/R11/scatt.txt in a four panels windows (2 times 2) draw the scatterplot of \(y_1\),\(y_2\),\(y_3\) and \(y_4\) as a function of \(x\). what relation between the ys and x do they show? What is the correlation between \(x\) and \(y_1\), \(y_2\), \(y_3\) and \(y_4\)?
  4. (**) Use abline to add a line that best predict the different ys as a function of \(x\) (you’ll have to redraw the preceding graphs). Use the lsfit or the lm function , and its element coeff to obtain the least square/linear model best fit.
  5. (*) what are the 10th and 90th percentile of \(y_5\)? plot the scatter diagram of \(y_5\) versus \(x\). What type of relation exists between the 2? use lsfit and abline to produce the best linear fit line.
  6. (*) Store the predicted value of \(y_5\) in a vector y5.pred. what are the residuals of the regression of \(y_5\) on \(x\)? store this in y5.res
  7. (*) plot the residuals as a function of x, and comment the graph.
  8. (*) do questions 6 & 7 for variable \(y_4\). Is showing residuals as a function of x useful? Why?

Practical 3: Functions in R

Introduction

The material described here is slightly more advanced, and may be skipped on a first read.

R has by default several built in functions (mean, var, cor, plot, lm, …). And several more can be downloaded as packages. But you might (will) need to write your own functions to carry out a repetitive task. The basic concepts are very similar to those you ve learned for any other programming languages, such as c++, java or fortran. This practical should give a first brief feel of function writing in R.

General structure of a function:

    my.function<-function(x,y){
    instruction 1
    instruction 2
    .
    .
    .
    instruction n
    }

Once it has been typed in, it is used as my.function(a,b)

where a and b are the two arguments to pass to the function. An example will clarify things:

    my.mean<-function(a){
      #this function will calculate the mean of a
    return(sum(a)/length(a))
    }
#
my.mean(1:10) # my.mean will take argument a to be the sequence 1:10
## [1] 5.5
#    
    my.var<-function(a){
    n<-length(a)
    m<-sum(a)/n
    va<-sum((a-m)^2)/n
    return(va)
    }
#
    my.var(1:10)
## [1] 8.25

In this function, return(va) on the last line allows returning va. You will notice that this last function does not give the same result as the built in var function. Why? Looking at the help for var will give it away: the built in var produces an unbiased variance, that is, the divider is (n-1) rather than n. We could rewrite our function as:

    my.var.ub<-function(a){
    n<-length(a)
    m<-sum(a)/n
    va<-sum((a-m)^2)/n
    va<-n/(n-1)*va
    return(va)
    }
#
    my.var.ubb<-function(a){
    n<-length(a)
    va<-my.var(a)*n/(n-1)
    return(va)
    }

Conditional statement

Ideally, our function should let us choose whether we want the variance with or without bias. We can use an if instruction in the body of the function to tell it what to return:

  my.var.n<-function(a,bias=TRUE){
    n<-length(a)
    m<-sum(a)/n
    va<-sum((a-m)^2)/n 
    if (!bias) {
    va<-va*n/(n-1)
    }
    return(va)
    }
#    
    my.var.n(1:10)
## [1] 8.25
    my.var.n(1:10,bias=FALSE)
## [1] 9.166667
    var(1:10)
## [1] 9.166667

In this function, bias is a boolean. By default, the biased variance is calculated, then the if statement is evaluated, and the unbiased variance is calculated if bias is false (! stands for negation). If we want the biased variance, we don t need to tell it to R, the default for bias is FALSE.

Control structures

The control structures for R are:

    if (cond) expr      (If cond is true, do expr)

    if (cond) expr1 else expr2 (if cond is TRUE, do expr1 else do expr2)

    if (cond) {cons.expr1; cons.expr2} else {alt.expr1; alt.expr2} (same as before
    but with several expression for each situation. Note the curly brakets 
    around the expressions)

    for(i in seq) expr (do expr as many times as there are element in seq. 
    x<-1:10
    for (i in 1:10) x[i]<-i+1 #2:11
    x #would have been easier and quicker to enter x<-x+1
##  [1]  2  3  4  5  6  7  8  9 10 11
while(cond) expr (do expr as long as cond is true)

repeat { expr; if (cond) break } (do expr as long as cond is FALSE, therefore as
long as we do not enter in the if statement. Beware, break is a necessity
if you dont want to go in an infinite loop)

To print the square of the 20 first integers, we could write:

   for (i in 1:20) {
    cat(i,i^2,"\n")
    }
## 1 1 
## 2 4 
## 3 9 
## 4 16 
## 5 25 
## 6 36 
## 7 49 
## 8 64 
## 9 81 
## 10 100 
## 11 121 
## 12 144 
## 13 169 
## 14 196 
## 15 225 
## 16 256 
## 17 289 
## 18 324 
## 19 361 
## 20 400
#for details on cat, ?cat. "\n" is the "return" character.

To print sequentially the sum of the first 10 integers:

    i<-0;j<-0;
    repeat{
    i<-i+1; j<-j+i;
    cat(i,j,"\n");
    if (i>=10) {
    break}
    }
## 1 1 
## 2 3 
## 3 6 
## 4 10 
## 5 15 
## 6 21 
## 7 28 
## 8 36 
## 9 45 
## 10 55
#
i<-0;j<-0;
while(i<10){
i<-i+1;j<-j+i;cat(i,j,"\n");
}
## 1 1 
## 2 3 
## 3 6 
## 4 10 
## 5 15 
## 6 21 
## 7 28 
## 8 36 
## 9 45 
## 10 55

Compare the result with cumsum(1:10).

Note: the ; allows to put several expressions on the same line.

More information on writing functions can be found in several online tutorials:

http://cran.r-project.org/doc/manuals/R-intro.html#Writing-your-own-functions

http://zoonek2.free.fr/UNIX/48_R/02.html

http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf


Questions

Stars preceding questions describe their difficulties, the more stars, the more difficult the question.

Questions
  1. (*) Obtain the mean, median, variance and mad of the following observations: 50; 60; 10; 20; 10; 40; 30; 800; 30; 40; 10

  2. (*) Redo the calculations without the outlier. Conclusions?

  3. (**) The skewness of a distribution is defined as \(1/n \sum_{i=1}^n [(Y_i-\bar{Y})/s]^3\), where \(Y_i\) are the observations, \(\bar{Y}\) their mean, \(n\) is the number of observations and \(s\) their standard deviation. Write a function my.skewness that calculate the skewness for any numeric vector, and estimate the skewness for the data in questions 1 and 2.

  4. (***) Write your own function to calculate the median and the MAD, and apply it to the data from questions 1 and 2 [Reminder: the median is the value of the data that split them in 2 equal parts. If the number of data points is odd, the median is the value of the (nth+1)/2 largest datum, and it is the mean between the (nth/2 and (nth/2+1) if the number of data points is even. Finally, MAD=med(abs(x(i)-med(x))]

  5. (*) Load the file class05.xls from http://www.unil.ch/popgen/teaching/R11/class2005.xls .How many non smoking students in this data set? how many different hair color?

  6. (*) Draw a histogram of sizes. Are the different size classes equifrequent?

  7. (**) Plot the empirical cumulative distribution of sizes (use plot and sort). Using this plot, estimate the 30th and 70th percentile of the distribution of sizes.

  8. (*) use histogram and qqnorm (?qqnorm) and qqline to check whether sizes are normally distributed. Which is better? How would you test for normality?

  9. (*) Use quantile to validate your approximation for question 6

  10. (*) What’s the mean weight? estimate it in kg then in gr (you don’t need to use R to estimate it in gr).

  11. (*) What’s the weight variance in kg^2? in gr^2? (you don’t need to use R to estimate it in gr^2)

  12. (*) rescale the variable weight so that it has mean 0 and sd 1. Why is this transformation useful? what is the 2.5th and 97.5th percentile of the rescaled weight?


Practical 4:Statistical distributions and functions in R

Introduction

Some of the material covered in this practical is slightly more advanced (marked with a star), and can be skipped on a first read.

R has built in functions to calculate probability densities, cumulative distribution, inverse of cumulative (quantile) for several standard probability distributions. Furthermore, R can generate random values from these distributions. There are 4 functions for each distribution, starting with the following letters:

   1. d: for density, for instance 'dnorm' for the probability density of the normal distribution
   2. p: for probability, e.g. 'pnorm';
   3. q: for quantile, e.g 'qnorm';
   4. r: for random, e.g. 'rnorm'; 

Common statistical distributions

The classical distributions are:

  • normal (see ?pnorm ): central for all stats. take mean and variance as parameters.
  • binomial (?pbinom): takes 2 parameters, n the number of trials and p, the probability of a success.
  • poisson (?ppois): determined by a single parameter lambda.
  • geometric (?pgeom)
  • uniform (?punif)
  • chisquare (?pchisq)
  • F (?pf)
  • t (?pt) …

These are the most common, but several others are also available:

?Distributions

and even more here:

http://cran.r-project.org/web/views/Distributions.html

Lets take 1000 random numbers from a normal with mean 0 and sd 1 (default parameters):

a<-rnorm(1000)

and plot a histogram of this vector:

hist(a)

qqnorm(a);qqline(a) #best way to verify normality!

do the same but with mean 4 and sd 5:

b<-rnorm(1000, mean=4, sd=5)
hist(b)

qqnorm(b);qqline(b)

pnorm gives the probability that a normal deviate is less than a given number X, for instance -1.96:

pnorm(-1.96)
## [1] 0.0249979
pnorm(1.96) #why?
## [1] 0.9750021

if a random variable follows a normal, then it has 2.5% chance to be smaller than -1.96 (and since this distribution is symmetrical, it also has a 2.5% chance of being larger than 1.96)

With qnorm, you do the reverse: what is the value of the normal deviate for which I have a 2.5% chance of finding a lower deviate:

qnorm(0.025)
## [1] -1.959964

Now try:

mean(a)
## [1] -0.01518839
sd(a)
## [1] 1.010412
b<-rnorm(1000,20,2)
mean(b)
## [1] 20.0048
sd(b)
## [1] 2.111253

some examples with other distributions:

rbinom(30,50,0.5) #30 random deviates from a binomial with n=50 and p=0.5
##  [1] 27 16 23 25 29 21 25 25 29 26 26 20 25 23 22 30 26 23 24 24 28 24 27 20 22
## [26] 27 18 16 27 26
rpois(30,2.4) # 30 random deviates from a Poisson with lambda=2.4
##  [1] 4 1 3 2 4 2 2 2 5 2 0 3 6 4 1 4 3 1 2 2 1 3 5 2 3 1 2 3 5 0
dpois(0,2.4) # Probability to observe 0 if the underlying distribution is a poisson with param 2.4
## [1] 0.09071795
plot(0:10,dpois(0:10,2.4),type="h")

Distribution of p-values (*)

A particularly interesting distribution is that of p-values. Under the null hypothesis, it can be shown that it should follow a uniform distribution. Now with big data available everywhere, we are often confronted with a very large number of p-values. And, by chance, some will be below the standard 5% threshold. How to decide which p-values are really outliers?

pval<-runif(100000)
hist(pval,breaks=seq(0,1,0.05))

out of 100000 p-values, 4897 are significant at the 5% level. Is there something to worry about?

n<-length(pval)
par(mfrow=c(1,2))
plot((1:n)/n,sort(pval),xlab="theo. perc.",ylab="obs. perc.");abline(c(0,1))
plot(-log10((n:1)/n),sort(-log10(pval)),xlab="theo perc. (-log10)",ylab="obs. perc.(-log10)");abline(c(0,1))

now imagine that 5 of the p-values are really, really small, less than \(10^{-6}\) say:

pval[1:5]<-runif(5)/1000000
par(mfrow=c(1,2))
plot((1:n)/n,sort(pval),xlab="theo. perc.",ylab="obs. perc");abline(c(0,1))
plot(-log10((n:1)/n),sort(-log10(pval)),xlab="theo perc. (-log10)",ylab="obs. perc. (-log10)");abline(c(0,1))

The outlying values appear now clearly on the -log10 (right) panel, not on the untransformed panel. These type of plots are now commonly used to identify p-values that are outliers.

The sample function

Another useful function is the sample function. It allows sampling with or without replacement from a vector (see ?sample). This function is at the heart of randomization and bootstrap tests

sample(1:10,replace=FALSE) #permutations of the elements
##  [1]  6  7 10  1  5  2  8  4  3  9
table(sample(0:20,size=20,replace=TRUE)) #20 random draws of numbers between 0 & 20 with replacements
## 
##  0  1  2  4  5  6  7 12 13 15 17 20 
##  2  1  2  2  1  3  3  1  1  2  1  1
head(x1<-sample(1:0,size=1000,replace=TRUE,prob=c(0.2,0.8))) 
## [1] 0 1 0 0 0 0
#similar to, but much slower than 
head(x2<-rbinom(1000,size=1,p=0.2))
## [1] 0 1 0 0 0 0
mean(x1);mean(x2) 
## [1] 0.197
## [1] 0.206

Questions

Stars preceding questions describe their difficulties, the more stars, the more difficult the question.

Questions
  1. (**) sample without replacement 250 numbers between 0 and 1000. Draw a histogram of these numbers. Think of a graphical way of verifying if these numbers are coming from a uniform distribution (hint: have a look at qqplot). Can you think of a way for testing this? (more on this in the next practical)

  2. (*) Imagine that the size of the male population in Switzerland follows a normal distribution with mean 172cm and sd 30cm.

    1. What is the probability that an individual from this population is 1.60 m or less?
    2. 180 cm or less?
    3. 190 cm or more?
    4. 200 cm or more?
    5. What proportion of the male population should be between 1.5 and 1.80 meters tall?
  3. (*) The greater white tooth shrew, Crocidura russula, is a small anthropophilic insectivore leaving around here, and has been studied by research groups in Lausanne. The population size is stable, meaning that the mean number of surviving offspring per female is 2. Suppose that the number of surviving offspring follows a Poisson distribution. What is the probability that a female has no surviving offspring? one surviving offspring? 4 or more? (beware, there is a small trap here).

  4. (*/ **) In several species, the probability p of dying is independent of age. The distribution of mortality ages follows a geometric distribution \(P(X=x)=p(1-p)^x\). We are interested in a population where the probability of dying during a year is 0.15. You can use the geom (e.g pgeom, rgeom, dgeom, qgeom) function to answer the following:

    1. In this population, what is the probability of dying at 20?
    2. What is the probability of dying before 2?
    3. What is the probability of attaining 30?
    4. What is the mean age? (hint: think about expected value)
    5. What is the variance of age?

The following exercise is a good way of demonstrating the central limit theorem

  1. (*) Imagine that we are rolling a dice. Using the function sample, generate 1000 dice rolls and store them in a vector x1. How are these data distributed?

Now, roll two dices simultaneously a 1000 times, and store the sum of the 2 displayed numbers in a vector x2. How are these data distributed? Do the same for 5 dices (in vector x5) and 20 dices (in vector x20). Plot on the same figure the normal quantile quantile plots (qqnorm) obtained from these 4 experiments. Conclusions?

[optional] You can do the same for other distributions, for instance, the geometric distribution we have seen in the previous paragraph

The following two exercises will show you how to use R to do simulations. To solve them, start by thinking about what type of information you’d like the function to produce, and then think about ways of generating it. You’ll have to create vectors or matrices and use loops

  1. (***) Write a function that simulates the evolution of the size of a finite hermaphroditic population with non overlapping generations under demographic stochasticity. We will suppose that fertility follows a poisson distribution. This function should take as arguments the initial population size, the fertility, the number of generations over which you want to follow population size, and the possibility of plotting the graph of the census size as a function of generation. It should output the population sizes throught time at the end.

  2. (***) Write a function drift that simulates the effect of random genetic drift. It should take as parameter the population size (fixed), the number of generations, the initial allele frequency, a number of replicates. It should produce at the end a graph of the changes throught time of allele frequencies in the different replicates.


Practical 5: Statistical tests in R

Introduction

The questions at the end of this section should help you evaluate whether you also need to refresh your statistical knowledge. They can be divided in first, identifying the statistic and/or statistical test to apply; and second, evaluating the statistic and / or applying the test using R. Each can be answered using a standard statistical test you should have learnt about during your bachelor program. The way to run these tests with R is described below.

R has built in functions for carrying out your typical statistical tests. We will go through them in this practical.

we’ll use the following data set to illustrate some of these functions:

dat<-read.table("http://www.unil.ch/popgen/teaching/R11/class2005.txt",header=TRUE)

Categorical data

Chisquare tests:

chisq.test performs chi-squared contingency table tests AND goodness-of-fit tests.

Usage:

     chisq.test(x, y = NULL, correct = TRUE,
                p = rep(1/length(x), length(x)), rescale.p = FALSE,
                simulate.p.value = FALSE, B = 2000)

Note: if you have 2 vectors, each containing the state of the observations (for instance, eye and hair colours), you can obtain the contingency tables using the table function (?table). Another useful function is xtabs

fisher.test performs Fisher’s Exact Test for Count Data. Performs Fisher’s exact test for testing the null of independence of rows and columns in a contingency table with fixed marginals.

     fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
                 control = list(), or = 1, alternative = "two.sided",
                 conf.int = TRUE, conf.level = 0.95,
                 simulate.p.value = FALSE, B = 2000)

Comparing distributions

For comparing two empirical distributions, or an empirical distribution to a theoretical one

kolmogorov smirnov tests:

     ks.test(x, y, ..., alternative = c("two.sided", "less", "greater"),
             exact = NULL)

shapiro.test for testing normality, if you really need to

shapiro.test(x)

But most of the times, a qqnorm is preferable:

a<-rnorm(1000) #from normal dist
qqnorm(a);qqline(a) 

b<-rlnorm(1000) # from log normal dist
qqnorm(b);qqline(b)


Comparing two samples

t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)
t.test(x,mu=0)
t.test(x,y)
t.test(x,y,paired=T)

examples:

  x<-rnorm(30)
  y<-rnorm(30,mean=0.6)
  t.test(x)
## 
##  One Sample t-test
## 
## data:  x
## t = -0.69441, df = 29, p-value = 0.493
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.5065638  0.2497694
## sample estimates:
##  mean of x 
## -0.1283972
  t.test(y)
## 
##  One Sample t-test
## 
## data:  y
## t = 3.0188, df = 29, p-value = 0.005246
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.193342 1.005638
## sample estimates:
## mean of x 
## 0.5994898
  t.test(x,y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -2.6826, df = 57.707, p-value = 0.009513
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.2710861 -0.1846879
## sample estimates:
##  mean of x  mean of y 
## -0.1283972  0.5994898

Wilcoxon tests: (non parametric equivalent of t-tests)

wilcox.test(x,y)
wilcox.test(x,y,paired=T)

simple bootstrap percentile confidence intervals:

several functions in package boot. But simple to program

for instance, for a confidence interval of the mean:

bootstrap<-function(x,perc=c(0.025,0.975),nboot=1000){
bm<-vector(length=nboot) #vector where bootstrap samples are stored
n<-length(x)
for (ib in 1:nboot){
dum<-sample(n,replace=T)  #sampling with replacement of the observed vector
bm[ib]<-mean(x[dum],na.rm=T)  #bootstrap sample i of the mean
}
return(list(ci=quantile(bm,probs=perc),bm=bm))
}
#example
x<-runif(1000,min=500,max=1000)
bx<-bootstrap(x);
bx$ci;
##     2.5%    97.5% 
## 737.8967 755.4608
par(mfrow=c(1,2))
hist(x)
hist(bx$bm)


One way ANOVA:

oneway.test(y~A,var.equal=TRUE)
aov(y~A)

testing homogeneity of variance:

bartlett.test  #strongly depends on the normality assumption

Levene test: less sensitive to normality assumption, hence preferred, but has to be programmed. Simple idea: if variance are heterogeneous, then this should show if one does an anova on the absolute values of the residuals:

set.seed(12)    #to make sure that you obtain the same results as me
y1<-rnorm(30,mean=2,sd=2)
y2<-rnorm(30,mean=4,sd=10)
y3<-rnorm(30,mean=6,sd=2)
y<-c(y1,y2,y3)
a<-factor(rep(1:3,each=30))  #factor will be further explained below
par(mfrow=c(2,2))
boxplot(y~a)
res<-unlist(tapply(y,a,fun<-function(x) x-mean(x))) #tapply very useful, as is apply and lapply
#unlist is for transforming a list in a vector (providing elements of the list
#are of ths same type)
boxplot(res~a)
boxplot(abs(res)~a) #see the difference in mean/median among the groups ?
summary(aov(abs(res)~a))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## a            2  643.1   321.5   32.08 3.66e-11 ***
## Residuals   87  872.0    10.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The car library has a function for this, leveneTest (?leveneTest)


Permutation tests (distribution free ANOVA)

anova.perm<-function(y,x,nperm=100,var.equal=TRUE)
{
obs.f<-oneway.test(y~x,var.equal=var.equal)$statistic
bf<-numeric(length=nperm)
bf[nperm]<-obs.f
for (i in 1:(nperm-1)){
bf[i]<-oneway.test(y~sample(x),var.equal=var.equal)$statistic
}
return(list(p.val=sum(bf>=bf[nperm])/nperm,bf=bf))
}
#example
y<-rnorm(100)
x<-factor(rep(1:4,each=25))
oneway.test(y~x,var.equal=TRUE)
## 
##  One-way analysis of means
## 
## data:  y and x
## F = 0.65479, num df = 3, denom df = 96, p-value = 0.5819
res1<-anova.perm(y,x)
res1$p.val
## [1] 0.61

Non parametric equivalent to one way anova: Kruskal Wallis test:

kruskal.test(y~A)
kruskal.test(list)
kruskal.test(y,A)

Correlation

cor.test
     Test for association between paired samples, using one of
     Pearsons product moment correlation coefficient, Kendalls tau or
     Spearman s rho.

Usage:

     cor.test(x, ...)

     ## Default S3 method:
     cor.test(x, y,
              alternative = c("two.sided", "less", "greater"),
              method = c("pearson", "kendall", "spearman"),
              exact = NULL, conf.level = 0.95, ...)

pearson is classical correlation, spearman is the correlation on obs ranks

x<-rnorm(100)
y<-rnorm(100)+x
cor(x,y) #pearson
## [1] 0.7473563
cor.test(x,y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 11.135, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6457273 0.8229675
## sample estimates:
##       cor 
## 0.7473563
cor(x,y,method="spearman") #same as 
## [1] 0.7126553
cor(rank(x),rank(y))
## [1] 0.7126553
cor.test(x,y,method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 47886, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7126553
#kendall is more complicated, it's the difference in concordant vs discordant pairs of obs

Formulae in R

The operator ~ is used to define a model formula in R.

It means: explain what’s on the left of ~ by what’s on the right of ~

Suppose y, x, x0, x1, x2, … are numeric variables, X is a matrix and A, B, C, … are factors (?factor). The following formulae on the left side below specify statistical models as described on the right.

y ~ x
y ~ 1 + x
    Both imply the same simple linear regression model of y on x. The first has 
    an implicit intercept term, and the second an explicit one.
    
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
    Simple linear regression of y on x through the origin (that is, without an 
    intercept term).
log(y) ~ x1 + x2
    Multiple regression of the transformed variable, log(y), on x1 and x2 
    (with an implicit intercept term).
y ~ 1 + x + I(x^2)
    Polynomial regression of y on x of degree 2. 
y ~ A
    Single classification analysis of variance model of y, with classes 
    determined by A.
y ~ A + x
    Single classification analysis of covariance model of y, with classes 
    determined by A, and with covariate x.
y ~ A*B
y ~ A + B + A:B
y ~ A+ B %in% A
y ~ A/B
    Two factor non-additive model of y on A and B. The first two specify 
    the same crossed classification and the second three specify the same 
    nested classification. 
y ~ (A + B + C)^2
y ~ A*B*C - A:B:C
    Three factor experiment but with a model containing main effects and 
    two factor interactions only. Both formulae specify the same model.
y ~ A * x
y ~ A/x
y ~ A/(1 + x) - 1
    Separate simple linear regression models of y on x within the levels of A, 
    with different codings. The last form produces explicit estimates of as 
    many different intercepts and slopes as there are levels in A.
y ~ A*B + Error(C)
    An experiment with two treatment factors, A and B, and error strata 
    determined by factor C. For example a split plot experiment, with whole 
    plots (and hence also subplots), determined by factor C. 
    
cbind(y1,y2,y3) ~ A  

     formula for MANOVA    

Linear models (Regression, multiple regressions, ANOVAs, ANCOVAs)

The basic function for fitting ordinary multiple models is lm(), and a streamlined version of the call is as follows:

  fitted.model <- lm(formula, data = data.frame)

For instance

fm2 <- lm(y ~ x1 + x2, data = dat) 

would fit a multiple regression model of y on x1 and x2 (with implicit intercept term).

careful, sequential fit if x1 & x2 correlated, lm(y~x1+x2) !=lm(y~x2+x1)

  x1<-rnorm(100,mean=10,sd=2)
  x2<-rnorm(100,mean=5,sd=3)
  y<-8*x1+rnorm(100,sd=5)
  dat<-data.frame(y=y,x1=x1,x2=x2)
  fm2 <- lm(y ~ x1 + x2, data = dat)
  summary(fm2)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6228  -3.9317   0.8009   3.1696  10.9475 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.5440     2.6239  -1.351   0.1799    
## x1            8.1693     0.2457  33.248   <2e-16 ***
## x2            0.3054     0.1719   1.776   0.0788 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.93 on 97 degrees of freedom
## Multiple R-squared:  0.9196, Adjusted R-squared:  0.918 
## F-statistic: 554.8 on 2 and 97 DF,  p-value: < 2.2e-16
  anova(fm2)      
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq   F value Pr(>F)    
## x1         1 26896.2 26896.2 1106.5253 <2e-16 ***
## x2         1    76.7    76.7    3.1556 0.0788 .  
## Residuals 97  2357.8    24.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  fm1<-lm(y~x1,data=dat)
  anova(fm2,fm1)
## Analysis of Variance Table
## 
## Model 1: y ~ x1 + x2
## Model 2: y ~ x1
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1     97 2357.8                             
## 2     98 2434.5 -1   -76.703 3.1556 0.0788 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  par(mfrow=c(2,2)) #diagnostic plots, essential to check
  plot(fm1)

  par(mfrow=c(1,1))
#note: order of variables matters for prop of variance explained (anova)!
  fm2b <- lm(y ~ x2 + x1, data = dat)
  anova(fm2)
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq   F value Pr(>F)    
## x1         1 26896.2 26896.2 1106.5253 <2e-16 ***
## x2         1    76.7    76.7    3.1556 0.0788 .  
## Residuals 97  2357.8    24.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  anova(fm2b)
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq   F value  Pr(>F)    
## x2         1   104.1   104.1    4.2835 0.04114 *  
## x1         1 26868.8 26868.8 1105.3975 < 2e-16 ***
## Residuals 97  2357.8    24.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Generic functions for extracting model information

The value of lm() is a fitted model object; technically a list of results of class “lm”. Information about the fitted model can then be displayed, extracted, plotted and so on by using generic functions that orient themselves to objects of class “lm”. These include

 add1    deviance   formula      predict  step
 alias   drop1      kappa        print    summary
 anova   effects    labels       proj     vcov
 coef    family     plot         residuals

A brief description of the most commonly used ones is given below.

anova(object_1, object_2)
    Compare a submodel with an outer model and produce an analysis of variance table.



summary(object)
    Print a comprehensive summary of the results of the regression analysis.

coef(object)
    Extract the regression coefficient (matrix).

    Long form: coefficients(object).


plot(object)
    Produce four plots, showing residuals, fitted values and some diagnostics.


predict(object, newdata=data.frame)
    The data frame supplied must have variables specified with the same labels as 
    the original. The value is a vector or matrix of predicted values corresponding 
    to the determining variable values in data.frame.


residuals(object)
    Extract the (matrix of) residuals, weighted as appropriate.

    Short form: resid(object).


step(object)
    Select a suitable model by adding or dropping terms and preserving hierarchies. 
    The model with the largest value of AIC (Akaike s An Information Criterion) 
    discovered in the stepwise search is returned.

Generalized linear models:

glm(formula, family=gaussian)

 family can be one of:
 binomial(link = "logit")
 gaussian(link = "identity")
 Gamma(link = "inverse")
 inverse.gaussian(link = "1/mu^2")
 poisson(link = "log")
 quasi(link = "identity", variance = "constant")
 quasibinomial(link = "logit")
 quasipoisson(link = "log")

Mixed models (with a mixture of fixed and random effects),

use package lme4:

library(lme4)
?lmer

Ben Bolker maintains an extensive FAQ about mixed models and generalized mixed models


Multivariate tests:

manova(cbind(y1,y2,y3)~A))

prcomp() for PCA, but better to use package ade4:

library(ade4)
?dudi.pca

Questions

The questions below should help you evaluate whether you also need to refresh your statistical knowledge. They can be divided in first, identifying the statistic and/or statistical test to apply; and second, evaluating the statistic and / or applying the test using R. Each can be answered using a standard statistical test you should have learnt about during your bachelor program.

Questions

Answer these questions using the file class2005.txt at URL:

http://www2.unil.ch/popgen/teaching/R11/

  1. average size of the class? for males? for females?
  2. are the size normally distributed?
  3. confidence interval for the average size
  4. correlation between weight and size?
  5. can one predict the size of somebody based on his shoe size?
  6. Are eye and hair colors independent?
  7. Are smokers heavier than non-smokers?
  8. does answer to 7 depend on sex?
  9. Does size differ according to hair colour?
  10. Does the weight of boys and girls differ? does this depends on their respective size?
  11. Does the smoking status depends on sex?

Packages

Part of the strenght of R is its almost infinite number of packages available to enhance it. However, this flurry has grown without much order, and it is often difficult to find what one is looking for. A notable exception is the bioconductor suite of packages (http://www.bioconductor.org), which

provides tools for the analysis and comprehension of high-throughput genomic data.

Other specific areas have been loosely organized in task views, a list of which can be found at http://cran.r-project.org/web/views/

To install a package, use install.packages(). For instance to install ape, type

install.packages("ape")

You might find it useful to install the packages in a specific place rather than the default, either because you are working on a machine where you don’t have administrator rights, or because you don’t want to have to update everything each time you install a new version of R. One way is to create a .Renviron file in your home directory, where you specify R_LIBS=/your_path_to_Rlibs

To load a package, use library(package), e.g.:

library(ape)
?plot.phylo
example(plot.phylo)