Let’s take about an hour to take a quick assessment, touch up on basic R programming skills in a survey style, and then we’ll move onto some challenges which should test your understanding if we have time.
Everyone will arrive to this lesson with different experiences with R. Skill with R doesn’t necessarily exist a continuum and can instead be thought of as a set of tools. Each particpant may start the workshop with different tools.
In order to get a better sense of what topics in R we should focus more heavily on, let’s do a quick assessment. The results will help us shape instruction so that we can ensure we’re meeting everyone’s needs.
Instructions:
Answer the following 5 questions to the best of your knowledge and write down your answers.
Which of the following expressions assigns the number 2 to the variable x?
Choose one or more:
x == 2
x <- 2
x - 2
x = 2
Your answer:
What does the following expression return?
paste("apple", "pie")
## [1] "apple pie"
Choose one:
Your answer:
What does the following expression return?
max(abs(c(-5, 1, 5)))
## [1] 5
Choose one:
Your answer:
If x and y are both data.frames defined by:
x <- data.frame(z = 1:2)
y <- data.frame(z = 3)
which of the following expressions would be a correct way to combine them into one data.frame that looks like this:
z
-
1
2
3
(i.e. one column with the numbers 1, 2, and 3 in it)
Choose one or more:
join(x, y)
c(x, y)
rbind(x, y)
x + y
Your answer:
Given the following data.frame,
x <- data.frame(y = 1:10)
Which expression(s) return a data.frame
with rows where y is greater than 5 (i.e. 6 - 10)
Choose one or more:
x[x$y > 5,]
x$y > 5
x[which(x$y > 6),]
x[y > 5,]
subset(x, y > 5)
Based on previous R training: https://github.nceas.ucsb.edu/Training/R-intro. Instructor will go over these live with the classroom, running
<-
One of the things we’ll do all the time is save some value to a variable. Here, we save the word “apple” to a variable called fruit
fruit <- "apple"
fruit
## [1] "apple"
Notice the last line with just fruit
on it. Typing just the variable name in just prints the value to the Console.
R has a flexible syntax. The following two lines of code are identical to the above one.
fruit<-"apple"
fruit <- "apple"
+ - * / > >= %% %/%
etc2+2
## [1] 4
2 * 3
## [1] 6
2 ^ 3
## [1] 8
5/2
## [1] 2.5
Comparison:
2 == 1
## [1] FALSE
2 == 2
## [1] TRUE
3 > 2
## [1] TRUE
2 < 3 # Same as above
## [1] TRUE
"apple" == "apple"
## [1] TRUE
"apple" == "pair"
## [1] FALSE
"pair" == "apple" # Order doesn't matter for ==
## [1] FALSE
When we run a line of code like this:
x <- 2
We’re assigning 2 to a variable x
. x
is a variable but it is also a “numeric vector” of length 1.
class(x)
## [1] "numeric"
length(x)
## [1] 1
Above, we ran two function: class
and length
on our variable x
. Running functions is a very common thing you’ll do in R. Every function has a name, following by a pair of ()
with something inside.
We can make a numeric vector that is longer like so:
x <- c(1, 2, 3) # Use the `c` function to put things together
Notice we can also re-define a variable at a later point just like we did above.
class(x)
## [1] "numeric"
length(x)
## [1] 3
R can store much more than just numbers though. Let’s start with strings of characters, which we’ve already seen:
fruit <- "apple"
class(fruit)
## [1] "character"
length(fruit)
## [1] 1
Depending on your background, you may be surprised that the result of running length(fruit)
is 1 because “apple” is five characters long.
It turns out that fruit
is a character vector of length one, just like our numeric vector from before. To find out the number of characters in “apple”, we have to use another function:
nchar(fruit)
## [1] 5
nchar("apple")
## [1] 5
Let’s make a character vector of more than length one and take a look at how it works:
fruits <- c("apple", "banana", "strawberry")
length(fruits)
## [1] 3
nchar(fruits)
## [1] 5 6 10
fruits[1]
## [1] "apple"
Smushing character vectors together can be done with paste
:
paste("key", "lime", "pie")
## [1] "key lime pie"
Vectors and lists look similar in R sometimes but they have very different uses:
c(1, "apple", 3)
## [1] "1" "apple" "3"
list(1, "apple", 3)
## [[1]]
## [1] 1
##
## [[2]]
## [1] "apple"
##
## [[3]]
## [1] 3
Most of the time when doing analysis in R you will be working with data.frames
. data.frames
are tabular, with column headings and rows of data, just like a CSV file.
We create new data.frames
with a relevantly-named function:
mydata <- data.frame(site = c("A", "B", "C"),
temp = c(20, 30, 40))
mydata
## site temp
## 1 A 20
## 2 B 30
## 3 C 40
Or we can read in a CSV from the file system and turn it into a data.frame
in order to work with it in R:
mydata <- read.csv("data.csv")
mydata
## type name
## 1 fruit apple
## 2 vegetable eggplant
## 3 fruit orange
## 4 vegetable beet
## 5 fruit cherry
We can find out how many rows of data mydata
has in it:
nrow(mydata)
## [1] 5
We can return just one of the columns:
mydata$type
## [1] fruit vegetable fruit vegetable fruit
## Levels: fruit vegetable
unique(mydata$type)
## [1] fruit vegetable
## Levels: fruit vegetable
sort
If we want to sort mydata
, we use the order
function (in kind of a weird way):
mydata[order(mydata$type),]
## type name
## 1 fruit apple
## 3 fruit orange
## 5 fruit cherry
## 2 vegetable eggplant
## 4 vegetable beet
Let’s break the above command down a bit. We can access the individual cells of a data.frame
with a new syntax element: [
and [
:
mydata[1,] # First row
## type name
## 1 fruit apple
mydata[,1] # First column
## [1] fruit vegetable fruit vegetable fruit
## Levels: fruit vegetable
mydata[1,1] # First row, first column
## [1] fruit
## Levels: fruit vegetable
mydata[c(1,5),] # First and second row
## type name
## 1 fruit apple
## 5 fruit cherry
mydata$type # Column named 'type'
## [1] fruit vegetable fruit vegetable fruit
## Levels: fruit vegetable
So what does that order
function do?
?order # How to get help in R!
order(c(1, 2, 3))
## [1] 1 2 3
order(c(3, 2, 1))
## [1] 3 2 1
order(mydata$type)
## [1] 1 3 5 2 4
So order(mydata$type)
is returning the rows of mydata
, by row number, in sorted order.
We can also return just certain rows, based upon criteria:
mydata[mydata$type == "fruit",]
## type name
## 1 fruit apple
## 3 fruit orange
## 5 fruit cherry
mydata$type == "fruit"
## [1] TRUE FALSE TRUE FALSE TRUE
In this case, instead of indexing the rows by number, we’re using TRUEs and FALSEs.
Exercise: Subset mydata
to the vegetables instead of the fruit
# Your code here
Another handy way to subset data.frame
is with the subset
function:
subset(mydata, type == "fruit") # Equivalent to mydata[mydata$type == "fruit",]
## type name
## 1 fruit apple
## 3 fruit orange
## 5 fruit cherry
There are a lot of useful functions to help us work with data.frame
s:
str(mydata)
## 'data.frame': 5 obs. of 2 variables:
## $ type: Factor w/ 2 levels "fruit","vegetable": 1 2 1 2 1
## $ name: Factor w/ 5 levels "apple","beet",..: 1 4 5 2 3
summary(mydata)
## type name
## fruit :3 apple :1
## vegetable:2 beet :1
## cherry :1
## eggplant:1
## orange :1
Our data.frame
s won’t always be so small as this example one. Let’s look at a larger one:
library(ggplot2)
data("diamonds")
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## # ... with 53,930 more rows
Exercise: How many rows does diamonds have in it? How many columns?
We can look at the first few rows with head
, just like on the command line:
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
or the last few:
tail(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.72 Premium D SI1 62.7 59 2757 5.69 5.73 3.58
## 2 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
## 3 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
## 4 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
## 5 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
## 6 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
So far, this has probably been a bit boring. Let’s do something interesting and also something that R is very good at: Plotting and modeling!
Let’s plot the relationship between diamond price and carat:
plot(price ~ carat, data = diamonds)
The above syntax, price ~ carat
uses a response ~ predictor
form or y ~ x
.
We can also fit a linear model to the same relationship:
mod <- lm(price ~ carat, data = diamonds)
Above, we saved our linear model to a variable named mod
so we can use it later. We can look at the result of model fitting with summary
.
summary(mod)
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
And we can also plot the line of best fit on the scatterplot:
plot(price ~ carat, data = diamonds)
abline(mod$coefficients[[1]], mod$coefficients[[2]], col = "red", lwd = 5)
sample
/ runif
/ rnorm
table
for
loops
while
loopssapply
lapply
An excellent way to really learn a programming language is to call out what the result of running some expression will be before you run it. Afterwards, you can compare your expectation with what actually happened.
Here are code chunks with a series of expressions. Try to predict what **the final expression* does before running the entire chunk and add a note if you got one wrong.
x <- 2
x ^ 2
## [1] 4
x <- 1; y <- 2; x + y;
## [1] 3
x <- "hello"
y <- "world"
paste(x, y)
## [1] "hello world"
x <- list(1, 2, 3)
y <- list(4, 5, 6)
z <- c(x, y)
length(z)
## [1] 6
x <- data.frame(x = 1:6)
y <- data.frame(x = 1:7)
z <- rbind(x, y)
nrow(z)
## [1] 13
x <- NA
if (is.na(x)) {
print("foo")
} else {
print("bar")
}
## [1] "foo"
numbers <- seq(1, 10)
for (number in numbers) {
if (number %% 2){
print(number)
}
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
x <- 10
while (x >= 0) {
print(x)
if (x == 5) {
break
}
x <- x - 1
}
## [1] 10
## [1] 9
## [1] 8
## [1] 7
## [1] 6
## [1] 5
x <- c(1, "2", 3)
class(x)
## [1] "character"
x <- list(1, 2, 3)
lapply(x, cumsum)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
x <- data.frame(letter = LETTERS)
class(x[1,])
## [1] "factor"
x <- data.frame(letter = LETTERS)
class(x[1, 1, drop = FALSE])
## [1] "data.frame"
x <- data.frame(x = 1)
y <- data.frame(x = 2)
z <- rbind(x[1,1], y[1,1])
class(z)
## [1] "matrix"
rep(TRUE, 5) & rep(FALSE, 5)
## [1] FALSE FALSE FALSE FALSE FALSE
rep(TRUE, 5) && rep(FALSE, 5)
## [1] FALSE
x <- c(1, 2, 3)
y <- c("A", "B", "C")
z <- c(x, y)
class(z)
## [1] "character"
x <- c(1, 2, NA, 4, NA)
length(is.na(x))
## [1] 5
x <- c(1, NA, 3)
y <- c(NA, 2, NA)
all(is.na(x + y))
## [1] TRUE
By the end of this lesson, you should have feel touched up on your general R skills and you also should have seen some of the trickier parts of R. Hopefully having seen the trickier parts of R will help later on down the road.
Other good resources: