--- title: "An introduction to R" author: "Jerome Goudet" date: '`r Sys.Date()`' output: html_document: number_sections: yes toc: yes toc_depth: 3 pdf_document: toc: yes --- ```{r global_options, include=FALSE} knitr::opts_chunk$set(fig.width=12, fig.height=8, eval=TRUE, echo=TRUE, warning=FALSE, message=FALSE) ``` \newpage # Practical 1. A first R tutorial: win the lottery ## Prerequisites For connection via your own laptop, pick the guest-unil network, connect to the internet and enter the following guestpass: INTROR Moodle site http://moodle2.unil.ch/enrol/instances.php?id=3333 Inscription key: IntroR14 (to be confirmed) What you must have installed: Rstudio @ http://www.rstudio.com/ ## Introduction This course: The goal is *not to learn statistics* (I expect you to know your basic stats), but how to use R to do stats and other things. Focus will be on loading and manipulating data, writing function, plotting graphs, using packages, searching informations. Way it's gonna be run: each half day, short intro (1/2 to 1 hour max) followed by practical. 4 assistants will be here to help you during the practicals. Use them! **Assessment:** Active participation and doing the exercices. If you fail to do so, a test Thursday afternoon. Program 1. Monday am Introductory turorial 2. Monday pm Data manipulation in R: vectors, matrices, list et data frames (**room 2107**) 3. Tuesday am Basic graphing functions 4. Tuesday pm Functions in R 5. Wednesday am Functions in R 6. Wednesday pm Stats functions / bootstrap & randomization 7. Thursday am Stats function bestiary 8. Thursday pm Some packages (ade4, ...) / test if necessary Mornings: 9am-12am Afternoons: 1pm-5pm ### Documentation and setting things up What is R: http://stat.ethz.ch/CRAN/doc/FAQ/R-FAQ.html R editors: Why an editor? syntax highliting, plus avoid re-typing commands... 1. For All systems: integrated environment **RStudio** ( http://www.rstudio.com/ ) **Highly recommended** (i.e., install it if not done yet and use it for this course). We will use R markdown ( http://rmarkdown.rstudio.com/ ) to write notes and produce documents. [A short video about RStudio](http://player.vimeo.com/video/97166163?api=1&player_id=player_1) 2. For windows: Tinn-R, Notepad++ 3. For linux: ess (emacs speaks statistics) distributed as an R package, Vim,... 4. For Mac: Textmate or the online editor 5. a more extensive list: http://www.sciviews.org/_rgui/projects/Editors.html ### Installing and enhancing R 1. Binaries for windows, macOs X, several brends of linux and source code available from http://stat.ethz.ch/CRAN 2. extra packages to extend R: from the same URL. Currently (Thursday May 28th, 2015), the CRAN package repository features 6692 available packages 3. Also many packages available from github (development version) 4. R Home page for further information etc... (bookmark this link!) http://www.r-project.org/ 5. Search Engine specific for R: http://www.rseek.org/ 6. Much on CRAN in forms of tutorial, FAQ, etc... 7. Books: a lot available from http://www.r-project.org/doc/bib/R-books.html *** ## Questions/Tutorial One of the best ways to getting acquainted with R is to use it to help you to understand a particular set of data. So let us consider data issued from a lottery, where you might be motivated to perform data analysis. The readers are invited to work through the following familiarization session and see what happens. First-time users may not yet understand every detail, but **the best plan is to type what you see and observe what happens as a result**. This chapter is mainly based on Becker, [Chambers and Wilks's book](http://www.amazon.com/New-Language-R-A-Becker/dp/0534091938) (1988, The New S Language, Chapter 1). The specific data we will look at concerns the New Jersey Pick-It Lottery. Our data is for 254 drawings just after the lottery was started, from May, 1975 to March, 1976. Pick- It is a parimutuel game, meaning that the winners share a fraction of the money taken in for the particular drawing. Each ticket cost fifty cents and at the time of purchase the player picks a three-digit number ranging from 000 to 999. The money bet during the day is placed in a prize pool and anyone who picked the winning number shares equally in the pool. The data is in the file lottery.txt It can be loaded in R using the following command: ```{r} data<-read.table("http://www2.unil.ch/popgen/teaching/R11/lotery.txt",header=TRUE) ``` ```{r,eval=FALSE} data #print the dataset on screen ``` ```{r} #If you want to see only the first few lines: head(data) ``` The data available gives for each drawing the winning number and the payoff for a winning ticket. The winning numbers and the corresponding payoffs are: ```{r} data$number # print the winning numbers data$payoffs # print the payoffs ``` From these instructions we notice that, for the first drawing, the winning number was 810 and it paid \$190.00 to each winning ticket holder. In what follows we will try to examine the data. Numerical summaries provide a statistical synopsis of the data in a tabular format. Such a function is summary. The following displays a summary of the lottery payoffs: ```{r} summary(data$payoffs) ``` We read from this that the mean payoff was \$290.4, that the payoffs ranged from \$83 to \$869.5 and that 50% of all payoffs lay between \$194.2 and \$364. The quantiles for a set of data x can also be computed by means of quantile(x, quantiles). For example, ```{r} quantile(data$payoffs, c(.25, .75)) ``` This means that 25% of the values are less than 194, and 75% of the payoffs are less than 365. A better way to understand the data is to look at it graphically. With **RStudio**, a graphical window is already open, but if you are using classical *R* and want a new one, under windows, just type: ``` windows() ``` A separate window should appear on your screen. If you have a mac, type: ``` quartz() ``` or Unix/linux: ``` X11() ``` In our data, to detect long-term irregularities we will look at the winning numbers to see if they appear to be chosen at random. To do so we could produce a histogram of the lottery numbers: ```{r} hist(data$number) ``` The histogram should show on your screen. Since there are 10 bars, the count should be approximately 25. It looks fairly flat, no need to inform a jury. Of course, most of our attention will probably be directed at the payoffs. Elementary probabilistic reasoning tells us that a single number we pick has a 1 in 1000 chance of winning. If we play many times, we expect about 1 winning number per 1000 plays. Since a ticket costs fifty cents, 1000 plays will cost <$500, so we hope to win at least \$500 each time we win, otherwise we will lose money in the long run. So, let us make a histogram of the payoffs. ```{r} hist(data$payoffs) ``` The figure shows that payoffs range from less than \$100 to more than \$800, although the bulk of the payoffs are between \$100 and \$400, i.e. there were a number of payoffs larger than \$500. Perhaps we have a chance. The widely varying payoffs are primarily due to the pari mutuel betting in the lottery: if you win when few others win, you will get a large payoff. If you are unlucky enough to win along with others, the payoff may be relatively small. Let us see what the largest and the smallest payoffs and corresponding winning numbers were: ```{r} max(data$payoffs) # the largest payoff data$number[data$payoffs==max(data$payoffs)] min(data$payoffs) # the smallest payoff data$number[data$payoffs==min(data$payoffs)] ``` Winners who bet on '123' must have been disappointed; \$83 is not a very large payoff. On the other hand \$869.50 is very nice. Since the winning numbers and the payoffs come in pairs, a number and a payoff for each drawing, we can produce a scatterplot of the data to see if there is any relationship between the payoff and the winning number. R provides a generic plotting function, plot, which produces different kinds of plots depending on the data passed to it. In its most common use, it produces a scatterplot of two numeric objects: ```{r} plot(data$number, data$payoffs) ``` What do you see in the graph? Does the payoff seem to depend on the position of the winning number? Perhaps it would help to add a `middle' line that follows the overall pattern of the data: ```{r} plot(data$number, data$payoffs) lines(lowess(data$number, data$payoffs, f=.2)) ``` This command superimposes a smooth curve on the winning number and payoff scatterplot. Can you see the interesting characteristics now? There are substantially higher payoffs for numbers with a leading zero, meaning fewer people bet on these numbers. Perhaps that reflects people's reluctance to think of numbers with leading zeros. After all, no one writes \$010 on a ten dollar check! Also note that, expect for the numbers with leading zeros, payoffs seem to increase as the winning number increases. It would be interesting to see exactly what numbers correspond to the large payoffs. Fortunately, with an interactive graphical input device, we can do that by simply pointing at the `outliers': ```{r} plot(data$number, data$payoffs) lines(lowess(data$number, data$payoffs, f=.2)) identify(data$number, data$payoffs, data$number) #might not work on all platforms ``` To identify a payoff with its corresponding winning number just click on a point using the left mouse button. Once you have pointed out the 'outliers' just type the escape key. Can you see the pattern in the numbers with very high payoffs? As a little help it is to say that the lottery has a mode of betting, called 'combination bets' where players win if the digits in their number appear in any order (Ticket 123 would win on 321, 231). The pattern in the numbers is that most of the numbers with high payoffs have duplicate digits. This results from the fact that payoffs for the numbers with duplicate digits are not shared with combination betters, and thus are higher. Another method to look out for 'outliers' is to make a boxplot of the data. We noticed before that the payoffs seem to depend on the first digit of the winning number. So it would be interesting to draw boxes for the ten subsets of payoffs in a single plot to study this phenomenon graphically. Rather than extracting each set separately, we use the R function split to create a list, where each element of the list gives all of the payoffs that correspond to a particular first digit of the winning number. The boxplot function will draw a box for each element in the list. ```{r} digit<-trunc(data$number/100) # the first digits boxplot(split(data$payoffs, digit)) # boxplot of data split by groups title(xlab="First Digit of Winning Number", ylab="Payoff") abline(h=500) # horizontal line at 500 ``` The box in a boxplot contains the middle half of the data; the whiskers extending from the box reach to the most extreme non-outlier; outlying points are plotted individually. Notice the high payoffs for the first box. The graphic shows us as well that it is rare for a payoff to exceed \$500. So, place your bet if you enjoy gambling. Do not expect to win. ----- \newpage #Practical 2: Data manipulation in R: vectors, matrices, list et data frames ##Introduction You should be able to answer easily the questions at the bottom after having read and understood the text below. An advice: type the examples in R, it helps! The basic objects in R are vectors, tables, lists and "data frames". ###Vectors R operates on structured data bearing a name. The simplest structure is a vector, which is a simple entity consisting of an ordered collection of numbers. For instance, ```{r} x.vector<-c(5.1,6.3,7.8,9.3,10.5) ``` is a vector made up of five numbers. The order above uses the symbol of assignment '<-' and the function 'c' who combines the five values in a vector. Note that the assignment can also be made in the other direction: ```{r} c(5.1,6.3,7.8,9.3,10.5)->x.vector ``` If an expression is used without being assigned to an object, it is printed to the screen then is lost. For example the instruction ```{r} 1/x.vector ``` prints the inverse of the fives values, but does not store them. The vectors can be used in arithmetic expressions, the operations being then carried out on each element of the vector. For instance, look at the result of the 3 instructions below: ```{r} x.vector-1 2*(x.vector-1) x.vector^2 ``` where '^' stands for power let's make a second vector: ```{r} y.vector<-c(1.3,3.4,8.1,3.6,5.7) ``` with the same length as x.vector. Look at the results of the following instructions: ```{r} x.vector+y.vector ``` ```{r} x.vector*y.vector ``` Note that vectors of different lengths can be used in an operation. In this case, the size of the shortest vector is adjusted by duplicating it until it reaches the size of the longest vector. If the longest vector is not a multiple of the short vector, then a warning is emitted: ```{r} x2<-c(1,1) y5<-c(1,2,3,4,5) x2+y5 ``` > Warning message: > In x2 + y5 : > longer object length is not a multiple of shorter object length In top of the common operations, you can also used the functions 'log', 'log10', 'exp', 'sin', 'cos', 'tan', 'sqrt' etc... 'range' gives minima and maxima of teh vector or matrix: ```{r} range(x.vector) c(min(x.vector),max(x.vector)) ``` the number of items in a vector can be obtained using 'length': ```{r} length(x.vector) ``` thus, the valid indices for the elements of this vector are 1:length(vecteur.x) (: between 2 numbers is an instruction to generate the sequence of integers between the 2 numbers). Elements from a vector can be extracted by specifying the element number between square brackets '[]'. For instance, item 3 of x.vector is: ```{r} x.vector[3] ``` and the instruction ```{r} x.vector[1:3] ``` extract the first 3 elements of the vector. Two useful functions are the sum and product of the element of a vector: ```{r} sum(x.vector) prod(x.vector) ``` Vectors can be made of logical elements: ```{r} c(TRUE,FALSE,TRUE,TRUE,FALSE,TRUE) c(T,F,T,T,F,T) ``` where T stands for TRUE and F for FALSE. Beware, it is dangerous to use this shortcuts... We can also have character strings ```{r} c("This","is","my","first","R","course") ``` for a sequence of digits, use ':': ```{r} 1:10 ``` If you want it in steps of 2, type: ```{r} 2*1:10 ``` This works because ':' has higher precedence than "*". To obtain the reverse sequence: ```{r} 10:1 ``` A more general function to obtain a sequence is the function seq (try ?seq): ```{r} seq(-2*pi,3*pi,pi/4) ``` And another useful function is rep: ```{r} rep(1,10) rep(1:5,2) ``` Two other useful functions are 'sort' and 'order': ```{r} y<-c(3.1,6.3,2.1,4.4,-1.9) sort(y) order(y) #compare next expression with sort(y) y[order(y)] ``` ### Matrices Matrices are obtained through the instruction'matrix': ```{r} m1<-matrix(1:12,nrow=3) m1 ``` By default (see ?matrix), matrices are filled in columns. If you want to fill them by rows, use: ```{r} m2<-matrix(1:12,nrow=3,byrow=T) m2 (m2<-matrix(1:12,nrow=3,byrow=T)) # adding () arounds an expression prints it ``` Matrix dimension are obtained using 'dim': ```{r} dim(m1) ``` Note that a vector does not have a dimension (although it has a length): ```{r} dim(x.vector) ``` You can constrain a vector to become a matrix: ```{r} m.x.vector<-as.matrix(x.vector) dim(m.x.vector) ``` You can stick matrices together by rows or columns: ```{r} rbind(m1,m2) cbind(m1,m2) ``` Elements from a matrix are extracted using square brackets: ```{r} m1[3,2] m1[2,3] ``` We can access to a full line ```{r} m1[2,] ``` or column: ```{r} m1[,2] ``` We can also remove a line or a column from a matrix: ```{r} m1[-2,] m1[,-2] ``` elementwise multiplication of a matrix can be carried out: ```{r} m1*m2 ``` And classic matrix product is otained using '%*%': ```{r} m1 %*% t(m2) t(m1) %*% m2 ``` where 't' stands for transpose (by the way, why don't we take the product m1 %*% m2?). ### Lists A list in R is an object containing an ordered list of objects, its components. Objects in a list do not have to be of the same type. A new list is created using 'list' ```{r} (my.list<-list("first element"=T, "second element"="Coucou","third"=c(4,3),"fourth"=9.99)) ``` To get to a specific element of a liste, double square brackets are used'[[]]': ```{r} my.list[[2]] ``` Note that to obtain several elements, we use simple square brackets: ```{r} my.list[1:3] ``` Last, it is convenient to call an element in a list by its name rather than its position:, using '$' followed by the fist discriminatory characters of the element's name: ```{r} my.list$fir ``` (here, the first 2 letters would have been sufficient) ```{r} my.list$s #second element ``` the names of the elements in a list can be obtained with 'names': ```{r} names(my.list) ``` and the internal structure of a list can be obtained with 'str': ```{r} str(my.list) ``` ### Data frames Data frames can be considered as list of data vectors of the same length, but not necessarily of the same type (contrary to matrices). The main idea behind data frames is to group data corresponding to an entity of observation in a single structure. It's similar in concept to your familiar excel spreadsheet. Elements from a data frame can be accessed in the same way as those of a list or of a matrix: ```{r} a<-1:12 b<-12:1 cc<-letters[1:12] #why cc rather than c? d<-rep(c(TRUE,FALSE),each=6) #careful, T & F are dangerous, use TRUE and FALSE instead dataf<-data.frame(A=a,B=b,C=cc,D=d) dataf dataf[,2] dataf$B names(dataf) str(dataf) ``` square brackets allow selecting part of a data frame. The double equal sign '==' allows to select elements equal to a criterion . For instance in data frame dataf to select the element of d equal to TRUE: ```{r} dataf$D[dataf$D==TRUE] ``` To select the element of A for which D is equal to TRUE:: ```{r} dataf$A[dataf$D==TRUE] ``` And to select all the columns of the data frame for which D is TRUE: ```{r} dataf[dataf$D==TRUE,] #why the comma? ``` A extremely useful command is read.table, allowing to load a text file in the environment, and to associate it with a data frame. If for instance, you want to read the the content of file 'monfichier.txt' in R, you just need to type the command: ```{r,eval=FALSE} my.data<-read.table("monfichier.txt",header=T) my.data ``` For the command to work of course, the file needs to be in the working directory. To set the working directory use the command 'setwd("d:/mydir")" (note the /, not \), and to locate the current working directory use 'getwd()'. If the file is on the internet, you can just enter the URL. For instance, if your file is at URL: http://www.unil.ch/popgen/teaching/R11 You can load it with read.table using the full URL of the file: ```{r,eval=FALSE} my.data<-read.table("http://www.unil.ch/popgen/teaching/R11/monfichier.txt", header=T) my.data ``` Sometimes, for very large data sets, 'read.table' is very slow because it will attempt to guess the type of each column. If the data are all of the same type, you might want to use 'scan' (see `{r} ?scan`) instead of 'read.table'. Last, saving a data frame as a text file: ```{r,eval=FALSE} write.table(my.data,"mydata.txt",row.names=FALSE,quote=FALSE,sep="\t") ``` ---- ## Questions Answer questions 1 to 10 below: 0. Why should I avoid calling objects c, T or F? 1. clean your R environment by issuing the instruction rm(list=ls()) 2. For the following numbers 7.3, 6.8, 0.005, 9, 12, 2.4, 18.9, 0.9 1. Find their mean, 2. substract it from all the numbers, 3. get their square root, 4. print the numbers that are larger than their square root 3. use seq and rep to generate the following vectors: (a) 1; 2; 3; 4; 1; 2; 3; 4; 1; 2; 3; 4; 1; 2; 3; 4; (b) 10; 10; 10; 9; 9; 9; 8; 8; 8; 7; 7; 7; (c) 1; 2; 2; 3; 3; 3; 4; 4; 4; 4; 5; 5; 5; 5; 5; (d) 1; 1; 3; 3; 5; 5; 7; 7; 9; 9; 4. Create the following matrix, as simply as possible: > 3 1 1 1 > 1 3 1 1 > 1 1 3 1 > 1 1 1 3 4a. multiply this matrix elementwise with itself. Get its matrix product by itself 5. Solve the following system (have a look at solve) > 2x + 3y - z = 4 > 5x - 10y + 2z = 0 > x + y - 4z = 5 6. create a list mylist with the following elements: + numbers 1 to 5. call it nb1 + numbers 1 to 50 by 0.375 steps. call it nb3 + the string "My mummy told me". call it char1 + A vector made of the strings "apples", "bananas", "pears". call it char2 + The boolean vector FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE . Call it bool 7. What's the list's length? calculate the mean of the first element; of the second; of the first and second 8. paste the third and fourth element of the list (use paste); Which instruction would give the sentence "My mummy told me: apples, bananas, pears" (be careful, punctuation is important!) 9. Create a data frame with the data contained in file parus1.txt located at http://www2.unil.ch/popgen/teaching/R11 10. Create a vector data.fly containing the observations from alldat for which treat is mouche. Add to data the element: "puce", 12.7300 11. Try loading the file alien.txt (from http://www2.unil.ch/popgen/teaching/R11). What is wrong with this file? Fix it. 12. Try loading the file class2005.xls What needs to be done before you can load it into R? ------ \newpage # Practical 3: Basic graphing functions, scatter plot and regression ## Introduction ### Plots We have already met some of the graphic functions in R (hist, boxplot...). We will now see the plot function in more details. If you are not using **RStudio**, before anything, you need to define a graphic windows. on windows, use `windows()`, on mac, use `quartz()`, on unix/linux, use `X11()`; Plots can also be sent directly to specific files, without being printed on screen. available devices are `pdf()`, `postscript()`,`png()`,`jpeg()` among others. once you call such a device, you need to close it, by using the function `dev.off()`. pdf & postscript can have multiple pages, but not the others. For drawing any function, use plot: ```{r} x<--10:10 plot(x,sin(x)) ``` What is the difference between the previous and the command `plot(sin(x))` The plot instruction contains numerous options (`?plot` for more). For instance, one can gives labels to the axes, a title, the printing color and symbols etc.. ```{r} x<-seq(-10,10,0.01) #more on seq soon plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),ylab="f(x)") x<-seq(0,4,0.1) plot(x,exp(-(x^3+2*x^2-3*x+1)),ylab="exp(f(x))") x<-seq(0,4,0.01) plot(x,exp(-(x^3+2*x^2-3*x+1)),ylab="exp(f(x))",col="red") ``` plot default is to use points. using the type option, we can specify whether we want points and lines (type="b") or a line (type="l") or nothing (type="n"): ```{r} x<-seq(-10,10,0.01) plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="b") plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="l") plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="n") ``` If you want to see the 2 graphs in one window, you can specify the numbers of panel to show, using the par function (see ?par): ```{r} par(mfrow=c(1,2)) x<-seq(-10,10,0.01) plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12)) plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="l") par(mfrow=c(1,1))#to go back to one panel ``` We can also add several functions to the same graph. After first using plot, which will define the first things to be printed and axes length, labels etc, you can use the functions lines, text or points: ```{r} x<-seq(0,2,0.05) ymax<-10 plot(x,x,ylim=c(0,ymax),ylab="f(x)") lines(x,x^0.5,col="blue") points(x,x^2,col="red") text(x,x^3,labels="3",col="green") text(x,x^3,labels="3",col="green") segments(0.25,4,1.25,8,col="purple",lwd=2) ``` Note that several call to plot will erase each other in turn. plot is of course useful for showing scatter plot of 2 variables, but could also be used for a single vector: ```{r} a<-rnorm(100); plot(a) b<-which(abs(a)>1.96) #now create a vector specifying a red color for values #larger than 1.96 coul<-rep("black",100); coul[b]<-"red" points(a,col=coul) #print these points in red lines(a,type="h",col=coul) #type="h" draws a vertical bar from the x axis ``` Other useful plots are histograms, barplots and piecharts (for categorical data): ```{r} hist(a) categ<-rep(c("white","black","red"),c(20,20,40)) #We first need to make a count of the number of each color, using table: tcateg<-table(categ) barplot(tcateg,col=c("black","red","white")) pie(tcateg,col=c("black","red","white")) ``` list of colors predefined in R : `colors()` For a tour of R graphics ability, type: ```{r,eval=FALSE} demo(graphics) ``` ### Variance, covariances, correlations Variance, covariance and correlations are obtained using `var`, `cov` and `cor` respectively. It is possible to use these functions to obtain the correlation between 2 vectors, or all variables (columns) of a data frame: ```{r,eval=FALSE} cor(x,y); cor(d); ``` These functions don't like missing data. If you have such data, you need to specify how to remove them. You could either remove the whole line containing one missing observation, `(use="complete.obs")`, or eliminate them only when they turn up in a calculation `(use="pairwise.complete.obs")`. Compare the results of the following instructions: ```{r} x<-c(1,2,3,4,NA); y<-c(1,2,3,4,-100); z<-c(4,3,2,1,10) d<-data.frame(x,y,z) cor(d, use="complete.obs") cor(d, use="pairwise.complete.obs") ``` these functions return one element only (a number or a matrix. More complex functions, for instance lsfit (`?lsfit`) or lm (`?lm`), return a list of elements. To access each element individually use \$. For instance: ```{r,eval=FALSE} fit<-lsfit(x,y) fit$coef lm(y~x)$coef ``` To get the names of all the elements of an object, use `names()`. A word on lsfit. It allows to estimate, using least square, the slope and intercept of a a regression line $Y=a+X*b$ The intercept a is the value of $Y$ for which $X=0$. The slope expresses how many units will Y varies when $X$ changes one unit. We'll see `lm` in more details later Also note that the package [ggplot](http://ggplot2.org/) : ```{r,eval=FALSE} #if package not yet installed (see practical 7) install.packages(ggplot2) library(ggplot2) #?ggplot for help ``` has yet many more options and capabilities. Also see http://ggplot2.org/ for details Last, 2 books specifically about R graphics: Murrell Paul (2006) R Graphics: http://www.stat.auckland.ac.nz/~paul/RGraphics/rgraphics.html Hadley Wicham (2009) ggplot2: http://ggplot2.org/book/ *** ## Questions 1. Load the file http://www.unil.ch/popgen/teaching/R11/data_etudiants_biol_0506.txt a. Rename columns with more meaningful names (e.g. sex, height, weight, shoesize etc...) b. Produce a bar chart of the proportion of each sex and the proportion of the different hair colors; produce a bar chart of the proportion of smoking men, non smoking men, smoking women, non-smoking women (use barplot; hint: you might need to use the table function) c. Produce a piechart of the proportion of eye colors d. Produce a boxplot of sizes as a function of sex e. Plot the scatterplot of weigths against height , using the sex of individuals as label, and labelling your axes. Use a different color for each sex. Explore the different arguments of the plot function: add a title, play with the size of characters. Try adding the regression line for males and that of females to the graph, using different colors for each sex. From this graph, would you say that the relation between weight and size differ between the two sexes? Which statistical test would be appropriate to test this hypothesis? f. Produce a box plot of weight as a function of the smoking status and sex g. Produce a two panels graph (a panel for males and the other for females) of the scatterplot of weight versus height. Differentiate smokers from non smokers using a color argument h. If you have the time and envy, explore the package `ggplot2` and find the commands necessary to produce the previous plots with it. 2. What is the correlation between weight and size in men? women? And the covariance? 3. load the file http://www.unil.ch/popgen/teaching/R11/scatt.txt in a four panels windows (2 times 2) draw the scatterplot of $y_1$,$y_2$,$y_3$ and $y_4$ as a function of $x$. what relation between the ys and x do they show? What is the correlation between $x$ and $y_1$, $y_2$, $y_3$ and $y_4$? 4. Use abline to add a line that best predict the different ys as a function of $x$ (you'll have to redraw the preceding graphs). Use the `lsfit` or the `lm` function , and its element coeff to obtain the least square/linear model best fit. 5. what are the 10th and 90th percentile of $y_5$? plot the scatter diagram of $y_5$ versus $x$. What type of relation exists between the 2? use `lsfit` and `abline` to produce the best linear fit line. 6. Store the predicted value of $y_5$ in a vector `y5.pred`. what are the residuals of the regression of $y_5$ on $x$? store this in `y5.res` 7. plot the residuals as a function of x, and comment the graph. 8. do questions 6 & 7 for variable $y_4$. Is showing residuals as a function of x useful? Why? ---- \newpage # Practical 4: Functions in R ## Introduction R has by default several builtin functions (`mean, var, cor, plot, lm`, ...). And several more can be downloaded from http://www.r-project.org as packages. But you might (will) need to write your own functions to carry out a repetitive task. The basic concepts are very similar to those you ve learned for any other programming languages, such as c++, java or fortran. This practical should give a first brief feel of function writing in R. ### General structure of a function: ```{r,eval=FALSE} my.function<-function(x,y){ instruction 1 instruction 2 . . . instruction n } ``` Once it has been typed in, it is used as `my.function(a,b)` where a and b are the two arguments to pass to the function. An example will clarify things: ```{r} my.mean<-function(a){ #this function will calculate the mean of a return(sum(a)/length(a)) } # my.mean(1:10) # my.var<-function(a){ n<-length(a) m<-sum(a)/n va<-sum((a-m)^2)/n return(va) } # my.var(1:10) ``` In this function, the return(va) on the last line allows returning va (try writing this function without the return statement). You will notice that this last function does not give the same results as the builtin var function. Why? Looking at the help for var will give it away: the builtin var produces an unbiased variance, that is, the divider is (n-1) rather than n. We could rewrite our function as: ```{r} my.var.ub<-function(a){ n<-length(a) m<-sum(a)/n va<-sum((a-m)^2)/n va<-n/(n-1)*va return(va) } # my.var.ubb<-function(a){ n<-length(a) va<-my.var(a)*n/(n-1) return(va) } ``` Ideally, our function should let us choose whether we want the variance with or without bias. We can use an if instruction in the body of the function to tell it what to return: ```{r} my.var.n<-function(a,bias=TRUE){ n<-length(a) m<-sum(a)/n va<-sum((a-m)^2)/n if (!bias) { va<-va*n/(n-1) } return(va) } # my.var.n(1:10) my.var.n(1:10,bias=FALSE) var(1:10) ``` In this function, bias is a boolean. By default, the biased variance is calculated, then the if statement is evaluated, and the unbiased variance is calculated if bias is false (! stands for negation). If we want the biased variance, we don t need to tell it to R, the default for bias is FALSE. ### Control structures The control structures for R are: ~~~ if (cond) expr (If cond is true, do expr) if (cond) expr1 else expr2 (if cond is TRUE, do expr1 else do expr2) if (cond) {cons.expr1; cons.expr2} else {alt.expr1; alt.expr2} (same as before but with several expression for each situation. Note the curly brakets around the expressions) for(i in seq) expr (do expr as many times as there are element in seq. ~~~ ```{r} x<-1:10 for (i in 1:10) x[i]<-i+1 #2:11 ``` while(cond) expr (do expr as long as cond is true) repeat { expr; if (cond) break } (do expr as long as cond is FALSE, therefore as long as we do not enter in the if statement. Beware, break is a necessity if you dont want to go in an infinite loop) To print the square of the 20 first integers, we could write: ```{r} for (i in 1:20) { cat(i,i^2,"\n") } #for details on cat, ?cat. "\n" is the "return" character. ``` To print sequentially the sum of the first 10 integers: ```{r} i<-0;j<-0; repeat{ i<-i+1; j<-j+i; cat(i,j,"\n"); if (i>=10) { break} } # i<-0;j<-0; while(i<10){ i<-i+1;j<-j+i;cat(i,j,"\n"); } ``` Compare the result with `cumsum(1:10)`. Note: the ; allows to put several expressions on the same line. More information on writing functions can be found in several online tutorials: http://cran.r-project.org/doc/manuals/R-intro.html#Writing-your-own-functions http://zoonek2.free.fr/UNIX/48_R/02.html http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf *** ## Questions 1. Obtain the mean, median, variance and mad of the following observations: 50; 60; 10; 20; 10; 40; 30; 800; 30; 40; 10 2. Redo the calculations without the outlier. Conclusions? 3. Redo these calculations without using the R builtin functions [Reminder: the median is the value of the data that split them in 2 equal parts. If the number of data points is odd, the median is the value of the (nth+1)/2 largest datum, and it s the mean between the (nth/2 and (nth/2+1) if the number of data points is even; and MAD=med(abs(x(i)-med(x))] 4. Load the file class05.xls from http://www.unil.ch/popgen/teaching/R11/class2005.xls .How many non smoking students in this data set? how many different hair color? 5. Draw a histogram of sizes. Are the different size classes equifrequent? 6. Plot the empirical cumulative distribution of sizes (use plot and sort). Using this plot, estimate the 30th and 70th percentile of the distribution of sizes. 7. use histogram and qqnorm (?qqnorm) and qqline to check whether sizes are normally distributed. Which is better? How would you test for normality? 8. Use quantile to validate your approximation for question 6 9. What's the mean weight? estimate it in kg then in gr (you don't need to use R to estimate it in gr). 10. What's the weight variance in kg^2? in gr^2? (you don't need to use R to estimate it in gr^2) 11. rescale the variable weight so that it has mean 0 and sd 1. Why is this transformation useful? what is the 2.5th and 97.5th percentile of the rescaled weight? *** \newpage # Practical 5:Statistical distributions and functions in R ## Introduction R has built in functions to calculate probability densities, cumulative distribution, inverse of cumulative (quantile) for several standard probability distributions. Furthermore, R can generate random values from these distributions. There are 4 functions for each distribution, starting with the following letters: ~~~ 1. d: for density, for instance 'dnorm' for the probability density of the normal distribution 2. p: for probability, e.g. 'pnorm'; 3. q: for quantile, e.g 'qnorm'; 4. r: for random, e.g. 'rnorm'; ~~~ The classical distributions are: - normal (see ?pnorm ): central for all stats. take mean and variance as parameters. - binomial (?pbinom): takes 2 parameters, n the number of trials and p, the probability of a success. - poisson (?ppois): determined by a single parameter lambda. - geometric (?pgeom) - uniform (?punif) - chisquare (?pchisq) - F (?pf) - t (?pt) ... These are the most common, but several others are also available: `?Distributions` and even more here: http://cran.r-project.org/web/views/Distributions.html Lets take 1000 random numbers from a normal with mean 0 and sd 1 (default parameters): ```{r} a<-rnorm(1000) ``` and plot a histogram of this vector: ```{r} hist(a) qqnorm(a);qqline(a) #best way to verify normality! ``` do the same but with mean 4 and sd 5: ```{r} b<-rnorm(1000, mean=4, sd=5) hist(b) qqnorm(b);qqline(b) ``` `pnorm` gives the probability that a normal deviate is less than a given number X, for instance -1.96: ```{r} pnorm(-1.96) pnorm(1.96) #why? ``` if a random variable follows a normal, then it has 2.5% chance to be smaller than -1.96 (and since this distribution is symmetrical, it also has a 2.5% chance of being larger than 1.96) With `qnorm`, you do the reverse: what is the value of the normal deviate for which I have a 2.5% chance of finding a lower deviate: ```{r} qnorm(0.025) ``` Now try: ```{r} mean(a) sd(a) b<-rnorm(1000,20,2) mean(b) sd(b) ``` some examples with other distributions: ```{r} rbinom(30,50,0.5) #30 random deviates from a binomial with n=50 and p=0.5 rpois(30,2.4) # 30 random deviates from a Poisson with lambda=2.4 dpois(0,2.4) # Probability to observe 0 if the underlying distribution is a poisson with param 2.4 plot(0:10,dpois(0:10,2.4),type="h") ``` A particularly interesting distribution is that of p-values and the null hypothesis. It can be shown that it should follow a uniform distribution. Now with big data available everywhere, we are often confronted with a very large number of p-values. And, by chance, some will be below the classical 5% threshold. How to decide which p-values are really outliers? ```{r} pval<-runif(100000) hist(pval,breaks=seq(0,1,0.05)) ``` out of `r length(pval)` p-values, `r sum(pval<=0.05)` are significant at the 5% level. Is there something to worry about? ```{r} n<-length(pval) par(mfrow=c(1,2)) plot((1:n)/n,sort(pval),xlab="theo. perc.",ylab="obs. perc.");abline(c(0,1)) plot(-log10((n:1)/n),sort(-log10(pval)),xlab="theo perc. (-log10)",ylab="obs. perc.(-log10)");abline(c(0,1)) ``` now imagine that 5 of the p-values are really, really small, less than $10^{-6}$ say: ```{r} pval[1:5]<-runif(5)/1000000 par(mfrow=c(1,2)) plot((1:n)/n,sort(pval),xlab="theo. perc.",ylab="obs. perc");abline(c(0,1)) plot(-log10((n:1)/n),sort(-log10(pval)),xlab="theo perc. (-log10)",ylab="obs. perc. (-log10)");abline(c(0,1)) ``` The outlying values appear now clearly on the -log10 (right) panel, not on the untransformed panel. These type of plots are now commonly used to identify p-values that are outliers. Another useful function, connected to the binomial distribution, is the function `sample`. It allows sampling with or without replacement from a vector (see `?sample`). This function is at the heart of randomization and bootstrap tests ```{r} sample(1:10,replace=FALSE) #permutations of the elements sample(0:99,size=200,replace=TRUE) #200 random draws of numbers between 0 & 99 sample(1:0,size=1000,replace=TRUE,prob=c(0.2,0.8)) #same as rbinom(1000,size=1,p=0.2) ``` *** ## Questions 1. generate 254 random numbers drawn from a uniform between 0 and 1000. Draw a histogram of these numbers. Does it look familiar? Go back to the lotery data set, and now think of a graphical way of verifying uniformity in the winning numbers (hint: have a look at qqplot). Can you think of a way for testing this? 2. Imagine that the size of the male population in Switzerland follows a normal with mean 1.72m and sd 0.3m. a. What is the probability that an individual from this population is 1.60 m or less? b. 1.80 m or less? c. 1.90 m or more? d. 2.00 m or more? e. What proportion of the male population should be between 1.5 and 1.80 meters tall? 3. The greater white tooth shrew, Crocidura russula, is a small anthropophilic insectivore leaving around here, and is studied by research groups in Lausanne. The population size is stable, meaning that the mean number of surviving offspring per female is 2. Suppose that the number of surviving offspring follows a Poisson distribution. What is the probability that a female has no surviving offspring? one surviving offspring? 4 or more? (beware, there is a small trap here). 4. In several species, the probability p of dying is independant of age. The distribution of mortality ages follows a geometric distribution $P(X=x)=p(1-p)^x$. We are interested in a population where the probability of dying during a year is 0.15. You can use the geom (e.g pgeom, rgeom, dgeom, qgeom) function to answer the following: a. In this population, what is the probability of dying at 20? b. What is the probability of dying before 2? c. What is the probability of attaining 30? d. What is the mean age? (hint: think about expected value) e. What is the variance of age? *The following exercice is a good way of demonstrating the central limit theorem* 5. Imagine that we are rolling a dice. Using the function sample, generate 1000 dice rolls and store them in a vector x1. How are these data distributed? Now, roll two dices simultaneously a 1000 times, and store the sum of the 2 displayed numbers in a vector x2. How are these data distributed? Do the same for 5 dices (in vector x5) and 20 dices (in vector x20). Plot on the same figure the normal quantile quantile plots (qqnorm) obtained from these 4 experiments. Conclusions? [optional] You can do the same for other distributions, for instance, the geometric distribution we ve seen in the previous paragraph *The following two exercices will show you how to use R to do simulations. To solve them, start by thinking about what type of information you'd like the function to produce, and then think about ways of generating it. You'll have to create vectors or matrices and use loops* 6. Write a function that simulates the evolution of the size of a finite hermaphroditic population with non overlapping generations under demographic stochasticity. We will suppose that fertility follows a poisson distribution. This function should take as arguments the initial population size, the fertility, the number of generations over which you want to follow population size, and the possibility of plotting the graph of the census size as a function of generation. It should output the population sizes throught time at the end. 7. Write a function drift that simulates the effect of random genetic drift. It should take as parameter the population size (fixed), the number of generations, the initial allele frequency, a number of replicates. It should produce at the end a graph of the changes throught time of allele frequencies in the different replicates. *** \newpage # Practical 6:TESTS in R ## Introduction R has builtin functions for carrying out your typical statistical tests. We will go through them in this practical. we'll use the following data set to illustrate some of these functions: ```{r} dat<-read.table("http://www.unil.ch/popgen/teaching/R11/class2005.txt",header=TRUE) ``` *** ### Categorical data Chisquare tests: `chisq.test` performs chi-squared contingency table tests AND goodness-of-fit tests. Usage: ```{r,eval=FALSE} chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000) ``` Note: if you have 2 vectors, each containing the state of the observations (for instance, eye and hair colours), you can obtain the contingency tables using the `table` function (`?table`). Another useful function is `xtabs `fisher.test` performs Fisher's Exact Test for Count Data. Performs Fisher's exact test for testing the null of independence of rows and columns in a contingency table with fixed marginals. ```{r,eval=FALSE} fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE, control = list(), or = 1, alternative = "two.sided", conf.int = TRUE, conf.level = 0.95, simulate.p.value = FALSE, B = 2000) ``` *** ### Comparing distributions For comparing two empirical distributions, or an empirical distribution to a theoretical one kolmogorov smirnov tests: ```{r,eval=FALSE} ks.test(x, y, ..., alternative = c("two.sided", "less", "greater"), exact = NULL) ``` shapiro.test : for testing normality *** ### Comparing two samples ```{r,eval=FALSE} t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) t.test(x,mu=0) t.test(x,y) t.test(x,y,paired=T) ``` examples: ```{r} x<-rnorm(30) y<-rnorm(30,mean=0.6) t.test(x) t.test(y) t.test(x,y) ``` Wilcoxon tests: (non parametric equivalent of t-tests) ```{r,eval=FALSE} wilcox.test(x,y) wilcox.test(x,y,paired=T) ``` *** ### simple bootstrap percentile confidence intervals: several functions in package boot. But simple to program for instance, for a confidence interval of the mean: ```{r} bootstrap<-function(x,perc=c(0.025,0.975),nboot=1000){ bm<-vector(length=nboot) #vector where bootstrap samples are stored n<-length(x) for (ib in 1:nboot){ dum<-sample(n,replace=T) #sampling with replacement of the observed vector bm[ib]<-mean(x[dum],na.rm=T) #bootstrap sample i of the mean } return(list(ci=quantile(bm,probs=perc),bm=bm)) } #example x<-runif(1000,min=500,max=1000) bx<-bootstrap(x); bx$ci; par(mfrow=c(1,2)) hist(x) hist(bx$bm) ``` *** ### One way ANOVA: ```{r,eval=FALSE} oneway.test(y~A,var.equal=TRUE) aov(y~A) ``` testing homogeneity of variance: ```{r,eval=FALSE} bartlett.test #strongly depends on the normality assumption ``` Levene test: less sensitive to normality assumption, hence preferred, but has to be programmed. Simple idea: if variance are heterogeneous, then this should show if one does an anova on the absolute values of the residuals: ```{r} set.seed(12) #to make sure that you obtain the same results as me y1<-rnorm(30,mean=2,sd=2) y2<-rnorm(30,mean=4,sd=10) y3<-rnorm(30,mean=6,sd=2) y<-c(y1,y2,y3) a<-factor(rep(1:3,each=30)) #factor will be further explained below par(mfrow=c(2,2)) boxplot(y~a) res<-unlist(tapply(y,a,fun<-function(x) x-mean(x))) #tapply very useful, as is apply and lapply #unlist is for transforming a list in a vector (providing elements of the list #are of ths same type) boxplot(res~a) boxplot(abs(res)~a) #see the difference in mean/median among the groups ? summary(aov(abs(res)~a)) ``` The car library has a function for this, `leveneTest` (`?leveneTest`) *** ### Permutation tests (distribution free ANOVA) ```{r} anova.perm<-function(y,x,nperm=100,var.equal=TRUE) { obs.f<-oneway.test(y~x,var.equal=var.equal)$statistic bf<-numeric(length=nperm) bf[nperm]<-obs.f for (i in 1:(nperm-1)){ bf[i]<-oneway.test(y~sample(x),var.equal=var.equal)$statistic } return(list(p.val=sum(bf>=bf[nperm])/nperm,bf=bf)) } #example y<-rnorm(100) x<-factor(rep(1:4,each=25)) oneway.test(y~x,var.equal=TRUE) res1<-anova.perm(y,x) res1$p.val ``` *** Non parametric equivalent to one way anova: Kruskal Wallis test: ```{r,eval=FALSE} kruskal.test(y~A) kruskal.test(list) kruskal.test(y,A) ``` *** ### Correlation ~~~ cor.test Test for association between paired samples, using one of Pearsons product moment correlation coefficient, Kendalls tau or Spearman s rho. Usage: cor.test(x, ...) ## Default S3 method: cor.test(x, y, alternative = c("two.sided", "less", "greater"), method = c("pearson", "kendall", "spearman"), exact = NULL, conf.level = 0.95, ...) ~~~ pearson is classical correlation, spearman is the correlation on obs ranks ```{r} x<-rnorm(100) y<-rnorm(100)+x cor(x,y) #pearson cor.test(x,y) cor(x,y,method="spearman") #same as cor(rank(x),rank(y)) cor.test(x,y,method="spearman") #kendall is more complicated, it's the difference in concordant vs discordant pairs of obs ``` *** ### Formulae in R The operator ~ is used to define a model formula in R. It means: explain what's on the left of ~ by what's on the right of ~ Suppose y, x, x0, x1, x2, ... are numeric variables, X is a matrix and A, B, C, ... are factors (?factor). The following formulae on the left side below specify statistical models as described on the right. ~~~ y ~ x y ~ 1 + x Both imply the same simple linear regression model of y on x. The first has an implicit intercept term, and the second an explicit one. y ~ 0 + x y ~ -1 + x y ~ x - 1 Simple linear regression of y on x through the origin (that is, without an intercept term). log(y) ~ x1 + x2 Multiple regression of the transformed variable, log(y), on x1 and x2 (with an implicit intercept term). y ~ 1 + x + I(x^2) Polynomial regression of y on x of degree 2. y ~ A Single classification analysis of variance model of y, with classes determined by A. y ~ A + x Single classification analysis of covariance model of y, with classes determined by A, and with covariate x. y ~ A*B y ~ A + B + A:B y ~ A+ B %in% A y ~ A/B Two factor non-additive model of y on A and B. The first two specify the same crossed classification and the second three specify the same nested classification. y ~ (A + B + C)^2 y ~ A*B*C - A:B:C Three factor experiment but with a model containing main effects and two factor interactions only. Both formulae specify the same model. y ~ A * x y ~ A/x y ~ A/(1 + x) - 1 Separate simple linear regression models of y on x within the levels of A, with different codings. The last form produces explicit estimates of as many different intercepts and slopes as there are levels in A. y ~ A*B + Error(C) An experiment with two treatment factors, A and B, and error strata determined by factor C. For example a split plot experiment, with whole plots (and hence also subplots), determined by factor C. cbind(y1,y2,y3) ~ A formula for MANOVA ~~~ *** ### Linear models (Regression, multiple regressions, ANOVAs, ANCOVAs) The basic function for fitting ordinary multiple models is lm(), and a streamlined version of the call is as follows: fitted.model <- lm(formula, data = data.frame) For instance ```{r,eval=FALSE} fm2 <- lm(y ~ x1 + x2, data = dat) ``` would fit a multiple regression model of y on x1 and x2 (with implicit intercept term). careful, sequential fit if x1 & x2 correlated, lm(y~x1+x2) !=lm(y~x2+x1) ```{r} x1<-rnorm(100,mean=10,sd=2) x2<-rnorm(100,mean=5,sd=3) y<-8*x1+rnorm(100,sd=5) dat<-data.frame(y=y,x1=x1,x2=x2) fm2 <- lm(y ~ x1 + x2, data = dat) summary(fm2) anova(fm2) fm1<-lm(y~x1,data=dat) anova(fm2,fm1) par(mfrow=c(2,2)) #diagnostic plots, essential to check plot(fm1) par(mfrow=c(1,1)) #note: order of variables matters for prop of variance explained (anova)! fm2b <- lm(y ~ x2 + x1, data = dat) anova(fm2) anova(fm2b) ``` Generic functions for extracting model information The value of lm() is a fitted model object; technically a list of results of class "lm". Information about the fitted model can then be displayed, extracted, plotted and so on by using generic functions that orient themselves to objects of class "lm". These include add1 deviance formula predict step alias drop1 kappa print summary anova effects labels proj vcov coef family plot residuals A brief description of the most commonly used ones is given below. ~~~ anova(object_1, object_2) Compare a submodel with an outer model and produce an analysis of variance table. summary(object) Print a comprehensive summary of the results of the regression analysis. coef(object) Extract the regression coefficient (matrix). Long form: coefficients(object). plot(object) Produce four plots, showing residuals, fitted values and some diagnostics. predict(object, newdata=data.frame) The data frame supplied must have variables specified with the same labels as the original. The value is a vector or matrix of predicted values corresponding to the determining variable values in data.frame. residuals(object) Extract the (matrix of) residuals, weighted as appropriate. Short form: resid(object). step(object) Select a suitable model by adding or dropping terms and preserving hierarchies. The model with the largest value of AIC (Akaike s An Information Criterion) discovered in the stepwise search is returned. ~~~ *** ### Generalized linear models: glm(formula, family=gaussian) family can be one of: binomial(link = "logit") gaussian(link = "identity") Gamma(link = "inverse") inverse.gaussian(link = "1/mu^2") poisson(link = "log") quasi(link = "identity", variance = "constant") quasibinomial(link = "logit") quasipoisson(link = "log") *** ### Mixed models (with a mixture of fixed and random effects), use package lme4: ```{r,eval=FALSE} library(lme4) ?lmer ``` *** ### Multivariate tests: `manova(cbind(y1,y2,y3)~A))` `prcomp()` for PCA, but better to use package ade4: ```{r,eval=FALSE} library(ade4) ?dudi.pca ``` *** ## Questions Answer these questions using the file class2005.txt at URL: http://www2.unil.ch/popgen/teaching/R11/ 1. average size of the class? for males? for females? 2. are the size normally distributed? 3. confidence interval for the average size 4. correlation between weight and size? 5. can one predict the size of somebody based on his shoe size? 6. Are eye and hair colors independant? 7. Are smokers heavier than non-smokers? 8. does answer to 7 depend on sex? 9. Does size differ according to hair colour? 10. Does the weight of boys and girls differ? does this depends on their respective size? 11. Does the smoking status depends on sex? *** # Practical 7: Packages Part of the strenght of **R** is its almost infinite number of packages available to enhance it. However, this flurry has grown without much order, and it is often difficult to find what one is looking for. A notable exception is the *bioconductor* suite of packages (http://www.bioconductor.org), which > provides tools for the analysis and comprehension of high-throughput genomic data. Other specific areas have been loosely organized in task views, a list of which can be found at http://cran.r-project.org/web/views/ To install a package, use `install.packages()`. For instance to install `ape`, type ```{r,eval=FALSE} install.packages(ape) ``` You might find it useful to install the packages in a specific place rather than the default, either because you are working on a machine where you don't have administrator rights, or because you don't want to have to update everything each time you install a new version of **R**. One way is to create a .Renviron file in your home directory, where you specify `R_LIBS=/your_path_to_Rlibs` To load a package, use `library(package)`, e.g.: ```{r,eval=FALSE} library(ape) ?plot.phylo example(plot.phylo) ```