Introduction

Let’s take about an hour to take a quick assessment, touch up on basic R programming skills in a survey style, and then we’ll move onto some challenges which should test your understanding if we have time.

Learning Outcomes

Assessment

Everyone will arrive to this lesson with different experiences with R. Skill with R doesn’t necessarily exist a continuum and can instead be thought of as a set of tools. Each particpant may start the workshop with different tools.

In order to get a better sense of what topics in R we should focus more heavily on, let’s do a quick assessment. The results will help us shape instruction so that we can ensure we’re meeting everyone’s needs.

Instructions:

Answer the following 5 questions to the best of your knowledge and write down your answers.

Question 1

Which of the following expressions assigns the number 2 to the variable x?

Choose one or more:

  • A. x == 2
  • B. x <- 2
  • C. x - 2
  • D. x = 2

Your answer:

Question 2

What does the following expression return?

paste("apple", "pie")
## [1] "apple pie"

Choose one:

  • A. “applepie”
  • B. “apple, pie”
  • C. “apple pie”
  • D. An error

Your answer:

Question 3

What does the following expression return?

max(abs(c(-5, 1, 5)))
## [1] 5

Choose one:

  • A. -5
  • B. 1
  • C. 5
  • D. An error

Your answer:

Question 4

If x and y are both data.frames defined by:

x <- data.frame(z = 1:2)
y <- data.frame(z = 3)

which of the following expressions would be a correct way to combine them into one data.frame that looks like this:

z
-
1
2
3

(i.e. one column with the numbers 1, 2, and 3 in it)

Choose one or more:

  • A. join(x, y)
  • B. c(x, y)
  • C. rbind(x, y)
  • D. x + y

Your answer:

Question 5

Given the following data.frame,

x <- data.frame(y = 1:10)

Which expression(s) return a data.frame with rows where y is greater than 5 (i.e. 6 - 10)

Choose one or more:

  • A. x[x$y > 5,]
  • B. `x$y > 5
  • C. x[which(x$y > 6),]
  • D. x[y > 5,]
  • E. subset(x, y > 5)

R overview

Based on previous R training: https://github.nceas.ucsb.edu/Training/R-intro. Instructor will go over these live with the classroom, running

The assignment operator, <-

One of the things we’ll do all the time is save some value to a variable. Here, we save the word “apple” to a variable called fruit

fruit <- "apple"
fruit
## [1] "apple"

Notice the last line with just fruit on it. Typing just the variable name in just prints the value to the Console.

R has a flexible syntax. The following two lines of code are identical to the above one.

fruit<-"apple"
fruit    <-     "apple"

R as a calculator: + - * / > >= %% %/% etc

2+2
## [1] 4
2 * 3
## [1] 6
2 ^ 3
## [1] 8
5/2
## [1] 2.5

Comparison:

2 == 1
## [1] FALSE
2 == 2
## [1] TRUE
3 > 2
## [1] TRUE
2 < 3 # Same as above
## [1] TRUE
"apple" == "apple"
## [1] TRUE
"apple" == "pair"
## [1] FALSE
"pair" == "apple" # Order doesn't matter for ==
## [1] FALSE

Types of variables

Vectors

When we run a line of code like this:

x <- 2

We’re assigning 2 to a variable x. x is a variable but it is also a “numeric vector” of length 1.

class(x)
## [1] "numeric"
length(x)
## [1] 1

Above, we ran two function: class and length on our variable x. Running functions is a very common thing you’ll do in R. Every function has a name, following by a pair of () with something inside.

We can make a numeric vector that is longer like so:

x <- c(1, 2, 3) # Use the `c` function to put things together

Notice we can also re-define a variable at a later point just like we did above.

class(x)
## [1] "numeric"
length(x)
## [1] 3

R can store much more than just numbers though. Let’s start with strings of characters, which we’ve already seen:

fruit <- "apple"
class(fruit)
## [1] "character"
length(fruit)
## [1] 1

Depending on your background, you may be surprised that the result of running length(fruit) is 1 because “apple” is five characters long.

It turns out that fruit is a character vector of length one, just like our numeric vector from before. To find out the number of characters in “apple”, we have to use another function:

nchar(fruit)
## [1] 5
nchar("apple")
## [1] 5

Let’s make a character vector of more than length one and take a look at how it works:

fruits <- c("apple", "banana", "strawberry")
length(fruits)
## [1] 3
nchar(fruits)
## [1]  5  6 10
fruits[1]
## [1] "apple"

Smushing character vectors together can be done with paste:

paste("key", "lime", "pie")
## [1] "key lime pie"

Lists

Vectors and lists look similar in R sometimes but they have very different uses:

c(1, "apple", 3)
## [1] "1"     "apple" "3"
list(1, "apple", 3)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "apple"
## 
## [[3]]
## [1] 3

data.frames

Most of the time when doing analysis in R you will be working with data.frames. data.frames are tabular, with column headings and rows of data, just like a CSV file.

We create new data.frames with a relevantly-named function:

mydata <- data.frame(site = c("A", "B", "C"),
                     temp = c(20, 30, 40))
mydata
##   site temp
## 1    A   20
## 2    B   30
## 3    C   40

Or we can read in a CSV from the file system and turn it into a data.frame in order to work with it in R:

mydata <- read.csv("data.csv")
mydata
##        type     name
## 1     fruit    apple
## 2 vegetable eggplant
## 3     fruit   orange
## 4 vegetable     beet
## 5     fruit   cherry

We can find out how many rows of data mydata has in it:

nrow(mydata)
## [1] 5

We can return just one of the columns:

mydata$type
## [1] fruit     vegetable fruit     vegetable fruit    
## Levels: fruit vegetable
unique(mydata$type)
## [1] fruit     vegetable
## Levels: fruit vegetable

sort

If we want to sort mydata, we use the order function (in kind of a weird way):

mydata[order(mydata$type),]
##        type     name
## 1     fruit    apple
## 3     fruit   orange
## 5     fruit   cherry
## 2 vegetable eggplant
## 4 vegetable     beet

Let’s break the above command down a bit. We can access the individual cells of a data.frame with a new syntax element: [ and [:

mydata[1,] # First row
##    type  name
## 1 fruit apple
mydata[,1] # First column
## [1] fruit     vegetable fruit     vegetable fruit    
## Levels: fruit vegetable
mydata[1,1] # First row, first column
## [1] fruit
## Levels: fruit vegetable
mydata[c(1,5),] # First and second row
##    type   name
## 1 fruit  apple
## 5 fruit cherry
mydata$type # Column named 'type'
## [1] fruit     vegetable fruit     vegetable fruit    
## Levels: fruit vegetable

So what does that order function do?

?order # How to get help in R!
order(c(1, 2, 3))
## [1] 1 2 3
order(c(3, 2, 1))
## [1] 3 2 1
order(mydata$type)
## [1] 1 3 5 2 4

So order(mydata$type) is returning the rows of mydata, by row number, in sorted order.

We can also return just certain rows, based upon criteria:

mydata[mydata$type == "fruit",]
##    type   name
## 1 fruit  apple
## 3 fruit orange
## 5 fruit cherry
mydata$type == "fruit"
## [1]  TRUE FALSE  TRUE FALSE  TRUE

In this case, instead of indexing the rows by number, we’re using TRUEs and FALSEs.

Exercise: Subset mydata to the vegetables instead of the fruit

# Your code here

Another handy way to subset data.frame is with the subset function:

subset(mydata, type == "fruit") # Equivalent to mydata[mydata$type == "fruit",]
##    type   name
## 1 fruit  apple
## 3 fruit orange
## 5 fruit cherry

There are a lot of useful functions to help us work with data.frames:

str(mydata)
## 'data.frame':    5 obs. of  2 variables:
##  $ type: Factor w/ 2 levels "fruit","vegetable": 1 2 1 2 1
##  $ name: Factor w/ 5 levels "apple","beet",..: 1 4 5 2 3
summary(mydata)
##         type         name  
##  fruit    :3   apple   :1  
##  vegetable:2   beet    :1  
##                cherry  :1  
##                eggplant:1  
##                orange  :1

Our data.frames won’t always be so small as this example one. Let’s look at a larger one:

library(ggplot2)
data("diamonds")
diamonds
## # A tibble: 53,940 x 10
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
##  4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
##  9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 53,930 more rows

Exercise: How many rows does diamonds have in it? How many columns?

We can look at the first few rows with head, just like on the command line:

head(diamonds)
## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

or the last few:

tail(diamonds)
## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.72   Premium     D     SI1  62.7    59  2757  5.69  5.73  3.58
## 2  0.72     Ideal     D     SI1  60.8    57  2757  5.75  5.76  3.50
## 3  0.72      Good     D     SI1  63.1    55  2757  5.69  5.75  3.61
## 4  0.70 Very Good     D     SI1  62.8    60  2757  5.66  5.68  3.56
## 5  0.86   Premium     H     SI2  61.0    58  2757  6.15  6.12  3.74
## 6  0.75     Ideal     D     SI2  62.2    55  2757  5.83  5.87  3.64

So far, this has probably been a bit boring. Let’s do something interesting and also something that R is very good at: Plotting and modeling!

Let’s plot the relationship between diamond price and carat:

plot(price ~ carat, data = diamonds)

The above syntax, price ~ carat uses a response ~ predictor form or y ~ x.

We can also fit a linear model to the same relationship:

mod <- lm(price ~ carat, data = diamonds)

Above, we saved our linear model to a variable named mod so we can use it later. We can look at the result of model fitting with summary.

summary(mod)
## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
## carat        7756.43      14.07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

And we can also plot the line of best fit on the scatterplot:

plot(price ~ carat, data = diamonds)
abline(mod$coefficients[[1]], mod$coefficients[[2]], col = "red", lwd = 5)