The goal of this document is to make sure you have the relevant
R
knowledge to follow the courses offered during the BEC
master program. What is described below corresponds loosely to what is
taught during the bachelor program at UNIL.
The book Getting started with R: an introduction for biologists, with material overlapping very much with what follows can be consulted online once you have a UNIL affiliation.
The document is divided in 5 practicals, with exercises at the end of each. Make sure you try to do these exercises. If you tried and want to see the solutions, (send me an email: jerome.goudet.[at].unil.ch) with your attempt at answering the exercises. Good luck!
What is R: http://stat.ethz.ch/CRAN/doc/FAQ/R-FAQ.html
R editors:
Why an editor? syntax highlighting, plus avoid re-typing commands…
For All systems: integrated environment RStudio ( http://www.rstudio.com/ ) Highly recommended (i.e., install it if not done yet and use it for this course). We will use R markdown to write notes and produce documents. A very brief video about RStudio
If you’ve never used R
and R-studio
before,
you can watch this video TO
BE UPDATED, create our own video
In this first part, we review data structures in R
, and
how to import and export data.
You should be able to answer the questions at the end of this section after having read and understood the text below. An advice: type (rather than copy/paste) the examples in R, it helps!
To interact with R
, you must enter commands at the
prompt
The most basic way to use R
is as a super calculator. In
the console window, enter the following commands:
2+3
## [1] 5
2*3
## [1] 6
2^3
## [1] 8
2^{-1}
## [1] 0.5
exp(2)
## [1] 7.389056
log(2)
## [1] 0.6931472
log10(2)
## [1] 0.30103
2e-03
## [1] 0.002
2e+03
## [1] 2000
The basic objects in R are vectors, matrices, lists and “data frames”.
R operates on structured data bearing a name. The simplest structure is a vector, which is a simple entity consisting of an ordered collection of numbers. For instance,
x.vector<-c(5.1,6.3,7.8,9.3,10.5)
is a vector made up of five numbers. The order above uses the symbol of assignment ‘<-’ and the function ‘c’ who combines the five values in a vector. Note that the assignment can also be made in the other direction:
c(5.1,6.3,7.8,9.3,10.5)->x.vector
If an expression is used without being assigned to an object, it is printed to the screen then is lost. For example the instruction
1/x.vector
## [1] 0.1960784 0.1587302 0.1282051 0.1075269 0.0952381
prints the inverse of the fives values, but does not store them.
The vectors can be used in arithmetic expressions, the operations being then carried out on each element of the vector. For instance, look at the result of the 3 instructions below:
x.vector-1
## [1] 4.1 5.3 6.8 8.3 9.5
2*(x.vector-1)
## [1] 8.2 10.6 13.6 16.6 19.0
x.vector^2
## [1] 26.01 39.69 60.84 86.49 110.25
where ‘^’ stands for power
let’s make a second vector:
y.vector<-c(1.3,3.4,8.1,3.6,5.7)
with the same length as x.vector. Look at the results of the following instructions:
x.vector+y.vector
## [1] 6.4 9.7 15.9 12.9 16.2
x.vector*y.vector
## [1] 6.63 21.42 63.18 33.48 59.85
Note that vectors of different lengths can be used in an operation. In this case, the size of the shortest vector is adjusted by duplicating it until it reaches the size of the longest vector. If the longest vector is not a multiple of the short vector, then a warning is emitted:
x2<-c(1,1)
y5<-c(1,2,3,4,5)
x2+y5
## [1] 2 3 4 5 6
Warning message: In x2 + y5 : longer object length is not a multiple of shorter object length
In top of the common operations, you can also used the functions ‘log’, ‘log10’, ‘exp’, ‘sin’, ‘cos’, ‘tan’, ‘sqrt’ etc… ‘range’ gives minima and maxima of teh vector or matrix:
range(x.vector)
## [1] 5.1 10.5
c(min(x.vector),max(x.vector))
## [1] 5.1 10.5
the number of items in a vector can be obtained using ‘length’:
length(x.vector)
## [1] 5
thus, the valid indices for the elements of this vector are 1:length(vecteur.x) (: between 2 numbers is an instruction to generate the sequence of integers between the 2 numbers). Elements from a vector can be extracted by specifying the element number between square brackets ‘[]’. For instance, item 3 of x.vector is:
x.vector[3]
## [1] 7.8
and the instruction
x.vector[1:3]
## [1] 5.1 6.3 7.8
extract the first 3 elements of the vector. Two useful functions are the sum and product of the element of a vector:
sum(x.vector)
## [1] 39
prod(x.vector)
## [1] 24472.46
Vectors can be made of logical elements:
c(TRUE,FALSE,TRUE,TRUE,FALSE,TRUE)
## [1] TRUE FALSE TRUE TRUE FALSE TRUE
c(T,F,T,T,F,T)
## [1] TRUE FALSE TRUE TRUE FALSE TRUE
where T stands for TRUE and F for FALSE. Beware, it is dangerous to use this shortcuts…
We can also have character strings
c("This","is","my","first","R","course")
## [1] "This" "is" "my" "first" "R" "course"
for a sequence of digits, use ‘:’:
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
If you want it in steps of 2, type:
2*1:10
## [1] 2 4 6 8 10 12 14 16 18 20
This works because ‘:’ has higher precedence than “*“. To obtain the reverse sequence:
10:1
## [1] 10 9 8 7 6 5 4 3 2 1
A more general function to obtain a sequence is the function seq (try ?seq):
seq(-2*pi,3*pi,pi/4)
## [1] -6.2831853 -5.4977871 -4.7123890 -3.9269908 -3.1415927 -2.3561945
## [7] -1.5707963 -0.7853982 0.0000000 0.7853982 1.5707963 2.3561945
## [13] 3.1415927 3.9269908 4.7123890 5.4977871 6.2831853 7.0685835
## [19] 7.8539816 8.6393798 9.4247780
And another useful function is rep:
rep(1,10)
## [1] 1 1 1 1 1 1 1 1 1 1
rep(1:5,2)
## [1] 1 2 3 4 5 1 2 3 4 5
Two other useful functions are ‘sort’ and ‘order’:
y<-c(3.1,6.3,2.1,4.4,-1.9)
sort(y)
## [1] -1.9 2.1 3.1 4.4 6.3
order(y)
## [1] 5 3 1 4 2
#compare next expression with sort(y)
y[order(y)]
## [1] -1.9 2.1 3.1 4.4 6.3
Matrices are obtained through the instruction
matrix
:
m1<-matrix(1:12,nrow=3)
m1
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
By default (see ?matrix
), matrices are filled in
columns. If you want to fill them by rows, use:
m2<-matrix(1:12,nrow=3,byrow=T)
m2
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
(m2<-matrix(1:12,nrow=3,byrow=T)) # adding () arounds an expression prints it
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
Matrix dimension are obtained using dim
:
dim(m1)
## [1] 3 4
Note that a vector does not have a dimension (although it has a length):
dim(x.vector)
## NULL
You can constrain a vector to become a matrix:
m.x.vector<-as.matrix(x.vector)
dim(m.x.vector)
## [1] 5 1
You can stick matrices together by rows or columns:
rbind(m1,m2)
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
## [4,] 1 2 3 4
## [5,] 5 6 7 8
## [6,] 9 10 11 12
cbind(m1,m2)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 4 7 10 1 2 3 4
## [2,] 2 5 8 11 5 6 7 8
## [3,] 3 6 9 12 9 10 11 12
Elements from a matrix are extracted using square brackets:
m1[3,2]
## [1] 6
m1[2,3]
## [1] 8
We can access to a full line
m1[2,]
## [1] 2 5 8 11
or column:
m1[,2]
## [1] 4 5 6
We can also remove a line or a column from a matrix:
m1[-2,]
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 3 6 9 12
m1[,-2]
## [,1] [,2] [,3]
## [1,] 1 7 10
## [2,] 2 8 11
## [3,] 3 9 12
elementwise multiplication of a matrix can be carried out:
m1*m2
## [,1] [,2] [,3] [,4]
## [1,] 1 8 21 40
## [2,] 10 30 56 88
## [3,] 27 60 99 144
And classic matrix product is otained using %*%
:
m1 %*% t(m2)
## [,1] [,2] [,3]
## [1,] 70 158 246
## [2,] 80 184 288
## [3,] 90 210 330
t(m1) %*% m2
## [,1] [,2] [,3] [,4]
## [1,] 38 44 50 56
## [2,] 83 98 113 128
## [3,] 128 152 176 200
## [4,] 173 206 239 272
where t
stands for transpose (by the way, why don’t we
take the product m1 %*% m2?).
A list in R is an object containing an ordered list of objects, its
components. Objects in a list do not have to be of the same type. A new
list is created using list
(my.list<-list("first element"=T, "second element"="Coucou","third"=c(4,3),"fourth"=9.99))
## $`first element`
## [1] TRUE
##
## $`second element`
## [1] "Coucou"
##
## $third
## [1] 4 3
##
## $fourth
## [1] 9.99
To get to a specific element of a list, double square brackets
[[]]
are used:
my.list[[2]]
## [1] "Coucou"
Note that to obtain several elements, we use simple square brackets:
my.list[1:3]
## $`first element`
## [1] TRUE
##
## $`second element`
## [1] "Coucou"
##
## $third
## [1] 4 3
Last, it is convenient to call an element in a list by its name
rather than its position:, using $
followed by the fist
discriminatory characters of the element’s name:
my.list$fir
## [1] TRUE
(here, the first 2 letters would have been sufficient)
my.list$s #second element
## [1] "Coucou"
the names of the elements in a list can be obtained with
names
:
names(my.list)
## [1] "first element" "second element" "third" "fourth"
and the internal structure of a list can be obtained with
str
:
str(my.list)
## List of 4
## $ first element : logi TRUE
## $ second element: chr "Coucou"
## $ third : num [1:2] 4 3
## $ fourth : num 9.99
Data frames can be considered as list of data vectors of the same
length, but not necessarily of the same type (contrary to
matrices).
The main idea behind data frames is to group data corresponding to an
entity of observation in a single structure. It’s similar in concept to
your familiar excel spreadsheet. Elements from a data frame can be
accessed in the same way as those of a list or of a matrix:
a<-1:12
b<-12:1
cc<-letters[1:12] #why cc rather than c?
d<-rep(c(TRUE,FALSE),each=6) #careful, T & F are dangerous, use TRUE and FALSE instead
dataf<-data.frame(A=a,B=b,C=cc,D=d)
dataf
## A B C D
## 1 1 12 a TRUE
## 2 2 11 b TRUE
## 3 3 10 c TRUE
## 4 4 9 d TRUE
## 5 5 8 e TRUE
## 6 6 7 f TRUE
## 7 7 6 g FALSE
## 8 8 5 h FALSE
## 9 9 4 i FALSE
## 10 10 3 j FALSE
## 11 11 2 k FALSE
## 12 12 1 l FALSE
dataf[,2]
## [1] 12 11 10 9 8 7 6 5 4 3 2 1
dataf$B
## [1] 12 11 10 9 8 7 6 5 4 3 2 1
names(dataf)
## [1] "A" "B" "C" "D"
str(dataf)
## 'data.frame': 12 obs. of 4 variables:
## $ A: int 1 2 3 4 5 6 7 8 9 10 ...
## $ B: int 12 11 10 9 8 7 6 5 4 3 ...
## $ C: chr "a" "b" "c" "d" ...
## $ D: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
square brackets allow selecting part of a data frame.
The double equal sign ==
allows to select elements equal to
a criterion . For instance in data frame dataf to select the element of
d equal to TRUE:
dataf$D[dataf$D==TRUE]
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
To select the element of A for which D is equal to
TRUE
:
dataf$A[dataf$D==TRUE]
## [1] 1 2 3 4 5 6
And to select all the columns of the data frame for which D is TRUE:
dataf[dataf$D==TRUE,] #why the comma?
## A B C D
## 1 1 12 a TRUE
## 2 2 11 b TRUE
## 3 3 10 c TRUE
## 4 4 9 d TRUE
## 5 5 8 e TRUE
## 6 6 7 f TRUE
To save your entire environment (all the objects you see in the environment windows), enter:
save.image("mydata.RData")
You can save a subset of your environment using save
with the names of the objects to save as arguments:
save(obj1,obj2,obj3,file="mydata.RData")
To save a single object, use saveRDS
:
saveRDS(myobj,file="myobj.RDS")
If you saved previously an R
environment with the
command save()
, you can load it using:
load("mydata.RData")
If you save an object using saveRDS
, you can load it
with:
myobj<-readRDS("myobj.RDS")
A extremely useful command is read.table, allowing to load a text file in the environment, and to associate it with a data frame.
If for instance, you want to read the the content of file ‘myfile.txt’ in R, you just need to type the command:
my.data<-read.table("myfile.txt",header=T)
my.data
For the command to work of course, the file needs to be in the
working directory. To set the working directory use the command
setwd("d:/mydir")
(note the /, not ), and to locate the
current working directory use ‘getwd()’.
If the file is on the internet, you can just enter the URL. For instance, if your file is at URL:
http://www.unil.ch/popgen/teaching/R11
You can load it with read.table using the full URL of the file:
my.data<-read.table("http://www.unil.ch/popgen/teaching/R11/myfile.txt",
header=T)
head(my.data)
Sometimes, for very large data sets, read.table
is very
slow because it will attempt to guess the type of each column. If the
data are all of the same type, you might want to use scan
(see ?scan
) instead of read.table
.
Saving a data frame as a text file:
write.table(my.data,"mydata.txt",row.names=FALSE,quote=FALSE,sep="\t")
Stars preceding questions describe their difficulties, the more stars, the more difficult the question.
Answer questions 1 to 10 below:
First, clean your R environment by issuing the instruction
rm(list=ls())
(*) Why should I avoid calling objects c, T or F?
(*) For the following numbers 7.3, 6.8, 0.005, 9, 12, 2.4, 18.9, 0.9
(*) use seq and rep to generate the following vectors:
(**) Create the following matrix, as simply as possible:
3 1 1 1 1 3 1 1 1 1 3 1 1 1 1 3
4a. (*) multiply this matrix elementwise with itself. Get its matrix product by itself
?solve
)2x + 3y - z = 4 5x - 10y + 2z = 0 x + y - 4z = 5
(*) create a list mylist with the following elements:
(* / **) What’s the list’s length? calculate the mean of the first element; of the second; of the first and second
(**) paste the third and fourth element of the list (use paste); Which instruction would give the sentence “My mummy told me: apples, bananas, pears” (be careful, punctuation is important!)
(*) Create a data frame with the data contained in file parus1.txt located at http://www2.unil.ch/popgen/teaching/R11
(*) Create a vector data.fly containing the observations from alldat for which treat is equal to “mouche”. Add to data the element: “puce”, 12.7300
(**) Try loading the file alien.txt (from http://www2.unil.ch/popgen/teaching/R11). What is wrong with this file? Fix it.
(**) Try loading the contents of the file class2005.xls What needs to be done before you can load it into R?
Here we will discover the different possibilities R has for plotting data, and we will also refresh the notion of variance, covariance, correlation and regression.
R
can do scatter plots, histograms, pie charts and bar
plots, boxplots and violin plots.
If you are not using RStudio, before anything, you
need to define a graphic windows. on windows, use
windows()
, on mac, use quartz()
, on
unix/linux, use X11()
;
We will start with the versatile plot
function.
Plots can also be sent directly to specific files, without being
printed on screen. available devices are pdf()
,
postscript()
,png()
,jpeg()
among
others. once you call such a device, you need to close it, by using the
function dev.off()
. pdf & postscript can have multiple
pages, but not the others.
For drawing any function, use plot:
x<- -10:10
plot(x,sin(x))
What is the difference between the previous and the command
plot(sin(x))
The plot instruction contains numerous arguments
(?plot
for more). For instance, one can gives labels to the
axes, a title, the printing color and symbols etc..
x<-seq(-10,10,0.01) #more on seq soon
plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),ylab="f(x)")
x<-seq(0,4,0.1)
plot(x,exp(-(x^3+2*x^2-3*x+1)),ylab="exp(f(x))")
x<-seq(0,4,0.01)
plot(x,exp(-(x^3+2*x^2-3*x+1)),ylab="exp(f(x))",col="red")
plot default is to use points. using the type argument, we can specify whether we want points and lines (type=“b”) or a line (type=“l”) or nothing (type=“n”):
x<-seq(-10,10,0.01)
plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="b")
plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="l")
plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="n")
If you want to see the 2 graphs in one window with two panels, you can specify the numbers of panels to show, using the par function (see ?par):
par(mfrow=c(1,2)) #split graphic window in 1 row and 2 columns
x<-seq(-10,10,0.01)
plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12))
plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="l")
par(mfrow=c(1,1))#to go back to one panel
We can also add several functions to the same graph. After first
using plot
, which will define the first thing to be printed
and axes length, labels etc, you can use the functions
lines
, text
or points
:
x<-seq(0,2,0.05)
ymax<-10
plot(x,x,ylim=c(0,ymax),ylab="f(x)")
lines(x,x^0.5,col="blue")
points(x,x^2,col="red")
text(x,x^3,labels="3",col="green")
text(x,x^3,labels="3",col="green")
segments(0.25,4,1.25,8,col="purple",lwd=2)
Note that several call to plot
will erase each other in
turn.
plot
is of course useful for showing scatter plot of 2
variables, but could also be used for a single vector:
a<-rnorm(100);
plot(a)
b<-which(abs(a)>1.96)
#now create a vector specifying a red color for values
#larger than 1.96
coul<-rep("black",100);
coul[b]<-"red"
points(a,col=coul) #print these points in red
lines(a,type="h",col=coul) #type="h" draws a vertical bar from the x axis
Other useful plots are histograms (?hist
) and boxplots
(?boxplot
):
hist(a)
hist(a,breaks=-40:40/10)
boxplot(a)
Boxplots are often used to compare groups of observations:
a1<-rnorm(100)
b1<-rnorm(100,mean=2)
c1<-rnorm(100,sd=3)
boxplot(a1,b1,c1)
For categorical data, we use barplots or piecharts:
categ<-rep(c("white","black","red"),c(20,20,40))
#We first need to make a count of the number of each color, using table:
tcateg<-table(categ)
barplot(tcateg,col=c("black","red","white"))
pie(tcateg,col=c("black","red","white"))
list of colors predefined in R : colors()
For a tour of R
graphics ability, type:
demo(graphics)
Variance, covariance and correlations are obtained using
var
, cov
and cor
respectively. It
is possible to use these functions to obtain the correlation between 2
vectors, or all variables (columns) of a data frame:
cor(x,y);
cor(d);
These functions don’t like missing data. If you have such data, you
need to specify how to remove them. You could either remove the whole
line containing one missing observation,
(use="complete.obs")
, or eliminate them only when they turn
up in a calculation (use="pairwise.complete.obs")
. Compare
the results of the following instructions:
x<-c(1,2,3,4,NA); y<-c(1,2,3,4,-100); z<-c(4,3,2,1,10)
d<-data.frame(x,y,z)
cor(d, use="complete.obs")
## x y z
## x 1 1 -1
## y 1 1 -1
## z -1 -1 1
cor(d, use="pairwise.complete.obs")
## x y z
## x 1 1.0000000 -1.0000000
## y 1 1.0000000 -0.9561118
## z -1 -0.9561118 1.0000000
these functions return one element only (a number or a matrix. More
complex functions, for instance lsfit (?lsfit
) or lm
(?lm
), return a list of elements. To access each element
individually use $. For instance:
fit<-lsfit(x,y)
fit$coef
lm(y~x)$coef
To get the names of all the elements of an object, use
names()
.
A word on lsfit. It allows to estimate, using least square, the slope
and intercept of a a regression line \(Y=a+X*b\)
The intercept a is the value of \(Y\)
for which \(X=0\). The slope expresses
how many units will Y varies when \(X\)
changes one unit.
We’ll see lm
in more details later
Also note that the package ggplot :
#if package not yet installed (see practical 7)
install.packages(ggplot2)
library(ggplot2) #?ggplot for help
has yet many more options and capabilities.
Last, a book specifically about R graphics: Murrell Paul (2006) R Graphics
Stars preceding questions describe their difficulties, the more stars, the more difficult the question.
ggplot2
and find the commands necessary to produce the
previous plots with it.lsfit
or the
lm
function , and its element coeff to obtain the least
square/linear model best fit.lsfit
and abline
to produce the
best linear fit line.y5.pred
.
what are the residuals of the regression of \(y_5\) on \(x\)? store this in y5.res
The material described here is slightly more advanced, and may be skipped on a first read.
R has by default several built in functions
(mean, var, cor, plot, lm
, …). And several more can be
downloaded as packages.
But you might (will) need to write your own functions to carry out a
repetitive task. The basic concepts are very similar to those you ve
learned for any other programming languages, such as c++, java or
fortran. This practical should give a first brief feel of function
writing in R.
my.function<-function(x,y){
instruction 1
instruction 2
.
.
.
instruction n
}
Once it has been typed in, it is used as
my.function(a,b)
where a and b are the two arguments to pass to the function. An example will clarify things:
my.mean<-function(a){
#this function will calculate the mean of a
return(sum(a)/length(a))
}
#
my.mean(1:10) # my.mean will take argument a to be the sequence 1:10
## [1] 5.5
#
my.var<-function(a){
n<-length(a)
m<-sum(a)/n
va<-sum((a-m)^2)/n
return(va)
}
#
my.var(1:10)
## [1] 8.25
In this function, return(va)
on the last line allows
returning va. You will notice that this last function does not give the
same result as the built in var
function. Why? Looking at
the help for var
will give it away: the built in
var
produces an unbiased variance, that is, the divider is
(n-1) rather than n. We could rewrite our function as:
my.var.ub<-function(a){
n<-length(a)
m<-sum(a)/n
va<-sum((a-m)^2)/n
va<-n/(n-1)*va
return(va)
}
#
my.var.ubb<-function(a){
n<-length(a)
va<-my.var(a)*n/(n-1)
return(va)
}
Ideally, our function should let us choose whether we want the
variance with or without bias. We can use an if
instruction
in the body of the function to tell it what to return:
my.var.n<-function(a,bias=TRUE){
n<-length(a)
m<-sum(a)/n
va<-sum((a-m)^2)/n
if (!bias) {
va<-va*n/(n-1)
}
return(va)
}
#
my.var.n(1:10)
## [1] 8.25
my.var.n(1:10,bias=FALSE)
## [1] 9.166667
var(1:10)
## [1] 9.166667
In this function, bias is a boolean. By default, the biased variance is calculated, then the if statement is evaluated, and the unbiased variance is calculated if bias is false (! stands for negation). If we want the biased variance, we don t need to tell it to R, the default for bias is FALSE.
The control structures for R are:
if (cond) expr (If cond is true, do expr)
if (cond) expr1 else expr2 (if cond is TRUE, do expr1 else do expr2)
if (cond) {cons.expr1; cons.expr2} else {alt.expr1; alt.expr2} (same as before
but with several expression for each situation. Note the curly brakets
around the expressions)
for(i in seq) expr (do expr as many times as there are element in seq.
x<-1:10
for (i in 1:10) x[i]<-i+1 #2:11
x #would have been easier and quicker to enter x<-x+1
## [1] 2 3 4 5 6 7 8 9 10 11
while(cond) expr (do expr as long as cond is true)
repeat { expr; if (cond) break } (do expr as long as cond is FALSE, therefore as
long as we do not enter in the if statement. Beware, break is a necessity
if you dont want to go in an infinite loop)
To print the square of the 20 first integers, we could write:
for (i in 1:20) {
cat(i,i^2,"\n")
}
## 1 1
## 2 4
## 3 9
## 4 16
## 5 25
## 6 36
## 7 49
## 8 64
## 9 81
## 10 100
## 11 121
## 12 144
## 13 169
## 14 196
## 15 225
## 16 256
## 17 289
## 18 324
## 19 361
## 20 400
#for details on cat, ?cat. "\n" is the "return" character.
To print sequentially the sum of the first 10 integers:
i<-0;j<-0;
repeat{
i<-i+1; j<-j+i;
cat(i,j,"\n");
if (i>=10) {
break}
}
## 1 1
## 2 3
## 3 6
## 4 10
## 5 15
## 6 21
## 7 28
## 8 36
## 9 45
## 10 55
#
i<-0;j<-0;
while(i<10){
i<-i+1;j<-j+i;cat(i,j,"\n");
}
## 1 1
## 2 3
## 3 6
## 4 10
## 5 15
## 6 21
## 7 28
## 8 36
## 9 45
## 10 55
Compare the result with cumsum(1:10)
.
Note: the ; allows to put several expressions on the same line.
More information on writing functions can be found in several online tutorials:
http://cran.r-project.org/doc/manuals/R-intro.html#Writing-your-own-functions
http://zoonek2.free.fr/UNIX/48_R/02.html
http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
Stars preceding questions describe their difficulties, the more stars, the more difficult the question.
(*) Obtain the mean, median, variance and mad of the following observations: 50; 60; 10; 20; 10; 40; 30; 800; 30; 40; 10
(*) Redo the calculations without the outlier. Conclusions?
(**) The skewness of a distribution is defined as \(1/n \sum_{i=1}^n [(Y_i-\bar{Y})/s]^3\),
where \(Y_i\) are the observations,
\(\bar{Y}\) their mean, \(n\) is the number of observations and \(s\) their standard deviation. Write a
function my.skewness
that calculate the skewness for any
numeric vector, and estimate the skewness for the data in questions 1
and 2.
(***) Write your own function to calculate the median and the MAD, and apply it to the data from questions 1 and 2 [Reminder: the median is the value of the data that split them in 2 equal parts. If the number of data points is odd, the median is the value of the (nth+1)/2 largest datum, and it is the mean between the (nth/2 and (nth/2+1) if the number of data points is even. Finally, MAD=med(abs(x(i)-med(x))]
(*) Load the file class05.xls from http://www.unil.ch/popgen/teaching/R11/class2005.xls .How many non smoking students in this data set? how many different hair color?
(*) Draw a histogram of sizes. Are the different size classes equifrequent?
(**) Plot the empirical cumulative distribution of sizes (use plot and sort). Using this plot, estimate the 30th and 70th percentile of the distribution of sizes.
(*) use histogram and qqnorm (?qqnorm) and qqline to check whether sizes are normally distributed. Which is better? How would you test for normality?
(*) Use quantile to validate your approximation for question 6
(*) What’s the mean weight? estimate it in kg then in gr (you don’t need to use R to estimate it in gr).
(*) What’s the weight variance in kg^2? in gr^2? (you don’t need
to use R to estimate it in gr^2)
(*) rescale the variable weight so that it has mean 0 and sd 1. Why is this transformation useful? what is the 2.5th and 97.5th percentile of the rescaled weight?
Some of the material covered in this practical is slightly more advanced (marked with a star), and can be skipped on a first read.
R has built in functions to calculate probability densities, cumulative distribution, inverse of cumulative (quantile) for several standard probability distributions. Furthermore, R can generate random values from these distributions. There are 4 functions for each distribution, starting with the following letters:
1. d: for density, for instance 'dnorm' for the probability density of the normal distribution
2. p: for probability, e.g. 'pnorm';
3. q: for quantile, e.g 'qnorm';
4. r: for random, e.g. 'rnorm';
The classical distributions are:
These are the most common, but several others are also available:
?Distributions
and even more here:
http://cran.r-project.org/web/views/Distributions.html
Lets take 1000 random numbers from a normal with mean 0 and sd 1 (default parameters):
a<-rnorm(1000)
and plot a histogram of this vector:
hist(a)
qqnorm(a);qqline(a) #best way to verify normality!
do the same but with mean 4 and sd 5:
b<-rnorm(1000, mean=4, sd=5)
hist(b)
qqnorm(b);qqline(b)
pnorm
gives the probability that a normal deviate is
less than a given number X, for instance -1.96:
pnorm(-1.96)
## [1] 0.0249979
pnorm(1.96) #why?
## [1] 0.9750021
if a random variable follows a normal, then it has 2.5% chance to be smaller than -1.96 (and since this distribution is symmetrical, it also has a 2.5% chance of being larger than 1.96)
With qnorm
, you do the reverse: what is the value of the
normal deviate for which I have a 2.5% chance of finding a lower
deviate:
qnorm(0.025)
## [1] -1.959964
Now try:
mean(a)
## [1] -0.01518839
sd(a)
## [1] 1.010412
b<-rnorm(1000,20,2)
mean(b)
## [1] 20.0048
sd(b)
## [1] 2.111253
some examples with other distributions:
rbinom(30,50,0.5) #30 random deviates from a binomial with n=50 and p=0.5
## [1] 27 16 23 25 29 21 25 25 29 26 26 20 25 23 22 30 26 23 24 24 28 24 27 20 22
## [26] 27 18 16 27 26
rpois(30,2.4) # 30 random deviates from a Poisson with lambda=2.4
## [1] 4 1 3 2 4 2 2 2 5 2 0 3 6 4 1 4 3 1 2 2 1 3 5 2 3 1 2 3 5 0
dpois(0,2.4) # Probability to observe 0 if the underlying distribution is a poisson with param 2.4
## [1] 0.09071795
plot(0:10,dpois(0:10,2.4),type="h")
A particularly interesting distribution is that of p-values. Under the null hypothesis, it can be shown that it should follow a uniform distribution. Now with big data available everywhere, we are often confronted with a very large number of p-values. And, by chance, some will be below the standard 5% threshold. How to decide which p-values are really outliers?
pval<-runif(100000)
hist(pval,breaks=seq(0,1,0.05))
out of 100000 p-values, 4897 are significant at the 5% level. Is there something to worry about?
n<-length(pval)
par(mfrow=c(1,2))
plot((1:n)/n,sort(pval),xlab="theo. perc.",ylab="obs. perc.");abline(c(0,1))
plot(-log10((n:1)/n),sort(-log10(pval)),xlab="theo perc. (-log10)",ylab="obs. perc.(-log10)");abline(c(0,1))
now imagine that 5 of the p-values are really, really small, less than \(10^{-6}\) say:
pval[1:5]<-runif(5)/1000000
par(mfrow=c(1,2))
plot((1:n)/n,sort(pval),xlab="theo. perc.",ylab="obs. perc");abline(c(0,1))
plot(-log10((n:1)/n),sort(-log10(pval)),xlab="theo perc. (-log10)",ylab="obs. perc. (-log10)");abline(c(0,1))
The outlying values appear now clearly on the -log10 (right) panel, not on the untransformed panel. These type of plots are now commonly used to identify p-values that are outliers.
Another useful function is the sample
function. It
allows sampling with or without replacement from a vector (see
?sample
). This function is at the heart of randomization
and bootstrap tests
sample(1:10,replace=FALSE) #permutations of the elements
## [1] 6 7 10 1 5 2 8 4 3 9
table(sample(0:20,size=20,replace=TRUE)) #20 random draws of numbers between 0 & 20 with replacements
##
## 0 1 2 4 5 6 7 12 13 15 17 20
## 2 1 2 2 1 3 3 1 1 2 1 1
head(x1<-sample(1:0,size=1000,replace=TRUE,prob=c(0.2,0.8)))
## [1] 0 1 0 0 0 0
#similar to, but much slower than
head(x2<-rbinom(1000,size=1,p=0.2))
## [1] 0 1 0 0 0 0
mean(x1);mean(x2)
## [1] 0.197
## [1] 0.206
Stars preceding questions describe their difficulties, the more stars, the more difficult the question.
(**) sample without replacement 250 numbers between 0 and 1000. Draw a histogram of these numbers. Think of a graphical way of verifying if these numbers are coming from a uniform distribution (hint: have a look at qqplot). Can you think of a way for testing this? (more on this in the next practical)
(*) Imagine that the size of the male population in Switzerland follows a normal distribution with mean 172cm and sd 30cm.
(*) The greater white tooth shrew, Crocidura russula, is a small anthropophilic insectivore leaving around here, and has been studied by research groups in Lausanne. The population size is stable, meaning that the mean number of surviving offspring per female is 2. Suppose that the number of surviving offspring follows a Poisson distribution. What is the probability that a female has no surviving offspring? one surviving offspring? 4 or more? (beware, there is a small trap here).
(*/ **) In several species, the probability p of dying is independent of age. The distribution of mortality ages follows a geometric distribution \(P(X=x)=p(1-p)^x\). We are interested in a population where the probability of dying during a year is 0.15. You can use the geom (e.g pgeom, rgeom, dgeom, qgeom) function to answer the following:
The following exercise is a good way of demonstrating the central limit theorem
Now, roll two dices simultaneously a 1000 times, and store the sum of the 2 displayed numbers in a vector x2. How are these data distributed? Do the same for 5 dices (in vector x5) and 20 dices (in vector x20). Plot on the same figure the normal quantile quantile plots (qqnorm) obtained from these 4 experiments. Conclusions?
[optional] You can do the same for other distributions, for instance, the geometric distribution we have seen in the previous paragraph
The following two exercises will show you how to use R to do simulations. To solve them, start by thinking about what type of information you’d like the function to produce, and then think about ways of generating it. You’ll have to create vectors or matrices and use loops
(***) Write a function that simulates the evolution of the size of a finite hermaphroditic population with non overlapping generations under demographic stochasticity. We will suppose that fertility follows a poisson distribution. This function should take as arguments the initial population size, the fertility, the number of generations over which you want to follow population size, and the possibility of plotting the graph of the census size as a function of generation. It should output the population sizes throught time at the end.
(***) Write a function drift that simulates the effect of random genetic drift. It should take as parameter the population size (fixed), the number of generations, the initial allele frequency, a number of replicates. It should produce at the end a graph of the changes throught time of allele frequencies in the different replicates.
The questions at the end of this section should help you evaluate
whether you also need to refresh your statistical knowledge. They can be
divided in first, identifying the statistic and/or statistical test to
apply; and second, evaluating the statistic and / or applying the test
using R
. Each can be answered using a standard statistical
test you should have learnt about during your bachelor program. The way
to run these tests with R
is described below.
R has built in functions for carrying out your typical statistical tests. We will go through them in this practical.
we’ll use the following data set to illustrate some of these functions:
dat<-read.table("http://www.unil.ch/popgen/teaching/R11/class2005.txt",header=TRUE)
Chisquare tests:
chisq.test
performs chi-squared contingency table tests
AND goodness-of-fit tests.
Usage:
chisq.test(x, y = NULL, correct = TRUE,
p = rep(1/length(x), length(x)), rescale.p = FALSE,
simulate.p.value = FALSE, B = 2000)
Note: if you have 2 vectors, each containing the state of the
observations (for instance, eye and hair colours), you can obtain the
contingency tables using the table
function
(?table
). Another useful function is xtabs
fisher.test
performs Fisher’s Exact Test for Count Data.
Performs Fisher’s exact test for testing the null of independence of
rows and columns in a contingency table with fixed marginals.
fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
control = list(), or = 1, alternative = "two.sided",
conf.int = TRUE, conf.level = 0.95,
simulate.p.value = FALSE, B = 2000)
For comparing two empirical distributions, or an empirical distribution to a theoretical one
kolmogorov smirnov tests:
ks.test(x, y, ..., alternative = c("two.sided", "less", "greater"),
exact = NULL)
shapiro.test
for testing normality, if you really need
to
shapiro.test(x)
But most of the times, a qqnorm
is preferable:
a<-rnorm(1000) #from normal dist
qqnorm(a);qqline(a)
b<-rlnorm(1000) # from log normal dist
qqnorm(b);qqline(b)
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
t.test(x,mu=0)
t.test(x,y)
t.test(x,y,paired=T)
examples:
x<-rnorm(30)
y<-rnorm(30,mean=0.6)
t.test(x)
##
## One Sample t-test
##
## data: x
## t = -0.69441, df = 29, p-value = 0.493
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.5065638 0.2497694
## sample estimates:
## mean of x
## -0.1283972
t.test(y)
##
## One Sample t-test
##
## data: y
## t = 3.0188, df = 29, p-value = 0.005246
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.193342 1.005638
## sample estimates:
## mean of x
## 0.5994898
t.test(x,y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = -2.6826, df = 57.707, p-value = 0.009513
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.2710861 -0.1846879
## sample estimates:
## mean of x mean of y
## -0.1283972 0.5994898
Wilcoxon tests: (non parametric equivalent of t-tests)
wilcox.test(x,y)
wilcox.test(x,y,paired=T)
several functions in package boot. But simple to program
for instance, for a confidence interval of the mean:
bootstrap<-function(x,perc=c(0.025,0.975),nboot=1000){
bm<-vector(length=nboot) #vector where bootstrap samples are stored
n<-length(x)
for (ib in 1:nboot){
dum<-sample(n,replace=T) #sampling with replacement of the observed vector
bm[ib]<-mean(x[dum],na.rm=T) #bootstrap sample i of the mean
}
return(list(ci=quantile(bm,probs=perc),bm=bm))
}
#example
x<-runif(1000,min=500,max=1000)
bx<-bootstrap(x);
bx$ci;
## 2.5% 97.5%
## 737.8967 755.4608
par(mfrow=c(1,2))
hist(x)
hist(bx$bm)
oneway.test(y~A,var.equal=TRUE)
aov(y~A)
testing homogeneity of variance:
bartlett.test #strongly depends on the normality assumption
Levene test: less sensitive to normality assumption, hence preferred, but has to be programmed. Simple idea: if variance are heterogeneous, then this should show if one does an anova on the absolute values of the residuals:
set.seed(12) #to make sure that you obtain the same results as me
y1<-rnorm(30,mean=2,sd=2)
y2<-rnorm(30,mean=4,sd=10)
y3<-rnorm(30,mean=6,sd=2)
y<-c(y1,y2,y3)
a<-factor(rep(1:3,each=30)) #factor will be further explained below
par(mfrow=c(2,2))
boxplot(y~a)
res<-unlist(tapply(y,a,fun<-function(x) x-mean(x))) #tapply very useful, as is apply and lapply
#unlist is for transforming a list in a vector (providing elements of the list
#are of ths same type)
boxplot(res~a)
boxplot(abs(res)~a) #see the difference in mean/median among the groups ?
summary(aov(abs(res)~a))
## Df Sum Sq Mean Sq F value Pr(>F)
## a 2 643.1 321.5 32.08 3.66e-11 ***
## Residuals 87 872.0 10.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The car library has a function for this, leveneTest
(?leveneTest
)
anova.perm<-function(y,x,nperm=100,var.equal=TRUE)
{
obs.f<-oneway.test(y~x,var.equal=var.equal)$statistic
bf<-numeric(length=nperm)
bf[nperm]<-obs.f
for (i in 1:(nperm-1)){
bf[i]<-oneway.test(y~sample(x),var.equal=var.equal)$statistic
}
return(list(p.val=sum(bf>=bf[nperm])/nperm,bf=bf))
}
#example
y<-rnorm(100)
x<-factor(rep(1:4,each=25))
oneway.test(y~x,var.equal=TRUE)
##
## One-way analysis of means
##
## data: y and x
## F = 0.65479, num df = 3, denom df = 96, p-value = 0.5819
res1<-anova.perm(y,x)
res1$p.val
## [1] 0.61
Non parametric equivalent to one way anova: Kruskal Wallis test:
kruskal.test(y~A)
kruskal.test(list)
kruskal.test(y,A)
cor.test
Test for association between paired samples, using one of
Pearsons product moment correlation coefficient, Kendalls tau or
Spearman s rho.
Usage:
cor.test(x, ...)
## Default S3 method:
cor.test(x, y,
alternative = c("two.sided", "less", "greater"),
method = c("pearson", "kendall", "spearman"),
exact = NULL, conf.level = 0.95, ...)
pearson is classical correlation, spearman is the correlation on obs ranks
x<-rnorm(100)
y<-rnorm(100)+x
cor(x,y) #pearson
## [1] 0.7473563
cor.test(x,y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 11.135, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6457273 0.8229675
## sample estimates:
## cor
## 0.7473563
cor(x,y,method="spearman") #same as
## [1] 0.7126553
cor(rank(x),rank(y))
## [1] 0.7126553
cor.test(x,y,method="spearman")
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 47886, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7126553
#kendall is more complicated, it's the difference in concordant vs discordant pairs of obs
The operator ~ is used to define a model formula in R.
It means: explain what’s on the left of ~ by what’s on the right of ~
Suppose y, x, x0, x1, x2, … are numeric variables, X is a matrix and A, B, C, … are factors (?factor). The following formulae on the left side below specify statistical models as described on the right.
y ~ x
y ~ 1 + x
Both imply the same simple linear regression model of y on x. The first has
an implicit intercept term, and the second an explicit one.
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
Simple linear regression of y on x through the origin (that is, without an
intercept term).
log(y) ~ x1 + x2
Multiple regression of the transformed variable, log(y), on x1 and x2
(with an implicit intercept term).
y ~ 1 + x + I(x^2)
Polynomial regression of y on x of degree 2.
y ~ A
Single classification analysis of variance model of y, with classes
determined by A.
y ~ A + x
Single classification analysis of covariance model of y, with classes
determined by A, and with covariate x.
y ~ A*B
y ~ A + B + A:B
y ~ A+ B %in% A
y ~ A/B
Two factor non-additive model of y on A and B. The first two specify
the same crossed classification and the second three specify the same
nested classification.
y ~ (A + B + C)^2
y ~ A*B*C - A:B:C
Three factor experiment but with a model containing main effects and
two factor interactions only. Both formulae specify the same model.
y ~ A * x
y ~ A/x
y ~ A/(1 + x) - 1
Separate simple linear regression models of y on x within the levels of A,
with different codings. The last form produces explicit estimates of as
many different intercepts and slopes as there are levels in A.
y ~ A*B + Error(C)
An experiment with two treatment factors, A and B, and error strata
determined by factor C. For example a split plot experiment, with whole
plots (and hence also subplots), determined by factor C.
cbind(y1,y2,y3) ~ A
formula for MANOVA
The basic function for fitting ordinary multiple models is lm(), and a streamlined version of the call is as follows:
fitted.model <- lm(formula, data = data.frame)
For instance
fm2 <- lm(y ~ x1 + x2, data = dat)
would fit a multiple regression model of y on x1 and x2 (with implicit intercept term).
careful, sequential fit if x1 & x2 correlated, lm(y~x1+x2) !=lm(y~x2+x1)
x1<-rnorm(100,mean=10,sd=2)
x2<-rnorm(100,mean=5,sd=3)
y<-8*x1+rnorm(100,sd=5)
dat<-data.frame(y=y,x1=x1,x2=x2)
fm2 <- lm(y ~ x1 + x2, data = dat)
summary(fm2)
##
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6228 -3.9317 0.8009 3.1696 10.9475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.5440 2.6239 -1.351 0.1799
## x1 8.1693 0.2457 33.248 <2e-16 ***
## x2 0.3054 0.1719 1.776 0.0788 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.93 on 97 degrees of freedom
## Multiple R-squared: 0.9196, Adjusted R-squared: 0.918
## F-statistic: 554.8 on 2 and 97 DF, p-value: < 2.2e-16
anova(fm2)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 26896.2 26896.2 1106.5253 <2e-16 ***
## x2 1 76.7 76.7 3.1556 0.0788 .
## Residuals 97 2357.8 24.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fm1<-lm(y~x1,data=dat)
anova(fm2,fm1)
## Analysis of Variance Table
##
## Model 1: y ~ x1 + x2
## Model 2: y ~ x1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 97 2357.8
## 2 98 2434.5 -1 -76.703 3.1556 0.0788 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2)) #diagnostic plots, essential to check
plot(fm1)
par(mfrow=c(1,1))
#note: order of variables matters for prop of variance explained (anova)!
fm2b <- lm(y ~ x2 + x1, data = dat)
anova(fm2)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 26896.2 26896.2 1106.5253 <2e-16 ***
## x2 1 76.7 76.7 3.1556 0.0788 .
## Residuals 97 2357.8 24.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fm2b)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x2 1 104.1 104.1 4.2835 0.04114 *
## x1 1 26868.8 26868.8 1105.3975 < 2e-16 ***
## Residuals 97 2357.8 24.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Generic functions for extracting model information
The value of lm() is a fitted model object; technically a list of results of class “lm”. Information about the fitted model can then be displayed, extracted, plotted and so on by using generic functions that orient themselves to objects of class “lm”. These include
add1 deviance formula predict step
alias drop1 kappa print summary
anova effects labels proj vcov
coef family plot residuals
A brief description of the most commonly used ones is given below.
anova(object_1, object_2)
Compare a submodel with an outer model and produce an analysis of variance table.
summary(object)
Print a comprehensive summary of the results of the regression analysis.
coef(object)
Extract the regression coefficient (matrix).
Long form: coefficients(object).
plot(object)
Produce four plots, showing residuals, fitted values and some diagnostics.
predict(object, newdata=data.frame)
The data frame supplied must have variables specified with the same labels as
the original. The value is a vector or matrix of predicted values corresponding
to the determining variable values in data.frame.
residuals(object)
Extract the (matrix of) residuals, weighted as appropriate.
Short form: resid(object).
step(object)
Select a suitable model by adding or dropping terms and preserving hierarchies.
The model with the largest value of AIC (Akaike s An Information Criterion)
discovered in the stepwise search is returned.
glm(formula, family=gaussian)
family can be one of:
binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")
quasi(link = "identity", variance = "constant")
quasibinomial(link = "logit")
quasipoisson(link = "log")
use package lme4:
library(lme4)
?lmer
Ben Bolker maintains an extensive FAQ about mixed models and generalized mixed models
manova(cbind(y1,y2,y3)~A))
prcomp()
for PCA, but better to use package ade4:
library(ade4)
?dudi.pca
The questions below should help you evaluate whether you also need to
refresh your statistical knowledge. They can be divided in first,
identifying the statistic and/or statistical test to apply; and second,
evaluating the statistic and / or applying the test using
R
. Each can be answered using a standard statistical test
you should have learnt about during your bachelor program.
Answer these questions using the file class2005.txt at URL:
http://www2.unil.ch/popgen/teaching/R11/
Part of the strenght of R is its almost infinite number of packages available to enhance it. However, this flurry has grown without much order, and it is often difficult to find what one is looking for. A notable exception is the bioconductor suite of packages (http://www.bioconductor.org), which
provides tools for the analysis and comprehension of high-throughput genomic data.
Other specific areas have been loosely organized in task views, a list of which can be found at http://cran.r-project.org/web/views/
To install a package, use install.packages()
. For
instance to install ape
, type
install.packages("ape")
You might find it useful to install the packages in a specific place
rather than the default, either because you are working on a machine
where you don’t have administrator rights, or because you don’t want to
have to update everything each time you install a new version of
R. One way is to create a .Renviron file in your home
directory, where you specify R_LIBS=/your_path_to_Rlibs
To load a package, use library(package)
, e.g.:
library(ape)
?plot.phylo
example(plot.phylo)