## Introduction

Let’s take about an hour to take a quick assessment, touch up on basic R programming skills in a survey style, and then we’ll move onto some challenges which should test your understanding if we have time.

## Learning Outcomes

• Assess where everyone is with R so we can shape the curricula
• Refresh general R programming skills
• Test R skills against some example problems

## Assessment

Everyone will arrive to this lesson with different experiences with R. Skill with R doesn’t necessarily exist a continuum and can instead be thought of as a set of tools. Each particpant may start the workshop with different tools.

In order to get a better sense of what topics in R we should focus more heavily on, let’s do a quick assessment. The results will help us shape instruction so that we can ensure we’re meeting everyone’s needs.

Instructions:

Answer the following 5 questions to the best of your knowledge and write down your answers.

### Question 1

Which of the following expressions assigns the number 2 to the variable x?

Choose one or more:

• A. `x == 2`
• B. `x <- 2`
• C. `x - 2`
• D. `x = 2`

Your answer:

### Question 2

What does the following expression return?

``paste("apple", "pie")``
``##  "apple pie"``

Choose one:

• A. “applepie”
• B. “apple, pie”
• C. “apple pie”
• D. An error

Your answer:

### Question 3

What does the following expression return?

``max(abs(c(-5, 1, 5)))``
``##  5``

Choose one:

• A. -5
• B. 1
• C. 5
• D. An error

Your answer:

### Question 4

If x and y are both data.frames defined by:

``````x <- data.frame(z = 1:2)
y <- data.frame(z = 3)``````

which of the following expressions would be a correct way to combine them into one data.frame that looks like this:

``````z
-
1
2
3``````

(i.e. one column with the numbers 1, 2, and 3 in it)

Choose one or more:

• A. `join(x, y)`
• B. `c(x, y)`
• C. `rbind(x, y)`
• D. `x + y`

Your answer:

### Question 5

Given the following data.frame,

x <- data.frame(y = 1:10)

Which expression(s) return a `data.frame` with rows where y is greater than 5 (i.e. 6 - 10)

Choose one or more:

• A. `x[x\$y > 5,]`
• B. ``x\$y > 5`
• C. `x[which(x\$y > 6),]`
• D. `x[y > 5,]`
• E. `subset(x, y > 5)`

## R overview

Based on previous R training: https://github.nceas.ucsb.edu/Training/R-intro. Instructor will go over these live with the classroom, running

• Basic syntax
• Variables & assignemnt

### The assignment operator, `<-`

One of the things we’ll do all the time is save some value to a variable. Here, we save the word “apple” to a variable called `fruit`

``````fruit <- "apple"
fruit``````
``##  "apple"``

Notice the last line with just `fruit` on it. Typing just the variable name in just prints the value to the Console.

R has a flexible syntax. The following two lines of code are identical to the above one.

``````fruit<-"apple"
fruit    <-     "apple"``````

### R as a calculator: `+ - * / > >= %% %/%` etc

``2+2``
``##  4``
``2 * 3``
``##  6``
``2 ^ 3``
``##  8``
``5/2``
``##  2.5``

Comparison:

``2 == 1``
``##  FALSE``
``2 == 2``
``##  TRUE``
``3 > 2``
``##  TRUE``
``2 < 3 # Same as above``
``##  TRUE``
``"apple" == "apple"``
``##  TRUE``
``"apple" == "pair"``
``##  FALSE``
``"pair" == "apple" # Order doesn't matter for ==``
``##  FALSE``

### Types of variables

#### Vectors

When we run a line of code like this:

``x <- 2``

We’re assigning 2 to a variable `x`. `x` is a variable but it is also a “numeric vector” of length 1.

``class(x)``
``##  "numeric"``
``length(x)``
``##  1``

Above, we ran two function: `class` and `length` on our variable `x`. Running functions is a very common thing you’ll do in R. Every function has a name, following by a pair of `()` with something inside.

We can make a numeric vector that is longer like so:

``x <- c(1, 2, 3) # Use the `c` function to put things together``

Notice we can also re-define a variable at a later point just like we did above.

``class(x)``
``##  "numeric"``
``length(x)``
``##  3``

R can store much more than just numbers though. Let’s start with strings of characters, which we’ve already seen:

``````fruit <- "apple"
class(fruit)``````
``##  "character"``
``length(fruit)``
``##  1``

Depending on your background, you may be surprised that the result of running `length(fruit)` is 1 because “apple” is five characters long.

It turns out that `fruit` is a character vector of length one, just like our numeric vector from before. To find out the number of characters in “apple”, we have to use another function:

``nchar(fruit)``
``##  5``
``nchar("apple")``
``##  5``

Let’s make a character vector of more than length one and take a look at how it works:

``````fruits <- c("apple", "banana", "strawberry")
length(fruits)``````
``##  3``
``nchar(fruits)``
``##   5  6 10``
``fruits``
``##  "apple"``

Smushing character vectors together can be done with `paste`:

``paste("key", "lime", "pie")``
``##  "key lime pie"``

#### Lists

Vectors and lists look similar in R sometimes but they have very different uses:

``c(1, "apple", 3)``
``##  "1"     "apple" "3"``
``list(1, "apple", 3)``
``````## []
##  1
##
## []
##  "apple"
##
## []
##  3``````

#### data.frames

Most of the time when doing analysis in R you will be working with `data.frames`. `data.frames` are tabular, with column headings and rows of data, just like a CSV file.

We create new `data.frames` with a relevantly-named function:

``````mydata <- data.frame(site = c("A", "B", "C"),
temp = c(20, 30, 40))
mydata``````
``````##   site temp
## 1    A   20
## 2    B   30
## 3    C   40``````

Or we can read in a CSV from the file system and turn it into a `data.frame` in order to work with it in R:

``````mydata <- read.csv("data.csv")
mydata``````
``````##        type     name
## 1     fruit    apple
## 2 vegetable eggplant
## 3     fruit   orange
## 4 vegetable     beet
## 5     fruit   cherry``````

We can find out how many rows of data `mydata` has in it:

``nrow(mydata)``
``##  5``

We can return just one of the columns:

``mydata\$type``
``````##  fruit     vegetable fruit     vegetable fruit
## Levels: fruit vegetable``````
``unique(mydata\$type)``
``````##  fruit     vegetable
## Levels: fruit vegetable``````

sort

If we want to sort `mydata`, we use the `order` function (in kind of a weird way):

``mydata[order(mydata\$type),]``
``````##        type     name
## 1     fruit    apple
## 3     fruit   orange
## 5     fruit   cherry
## 2 vegetable eggplant
## 4 vegetable     beet``````

Let’s break the above command down a bit. We can access the individual cells of a `data.frame` with a new syntax element: `[` and `[`:

``mydata[1,] # First row``
``````##    type  name
## 1 fruit apple``````
``mydata[,1] # First column``
``````##  fruit     vegetable fruit     vegetable fruit
## Levels: fruit vegetable``````
``mydata[1,1] # First row, first column``
``````##  fruit
## Levels: fruit vegetable``````
``mydata[c(1,5),] # First and second row``
``````##    type   name
## 1 fruit  apple
## 5 fruit cherry``````
``mydata\$type # Column named 'type'``
``````##  fruit     vegetable fruit     vegetable fruit
## Levels: fruit vegetable``````

So what does that `order` function do?

``````?order # How to get help in R!
order(c(1, 2, 3))``````
``##  1 2 3``
``order(c(3, 2, 1))``
``##  3 2 1``
``order(mydata\$type)``
``##  1 3 5 2 4``

So `order(mydata\$type)` is returning the rows of `mydata`, by row number, in sorted order.

We can also return just certain rows, based upon criteria:

``mydata[mydata\$type == "fruit",]``
``````##    type   name
## 1 fruit  apple
## 3 fruit orange
## 5 fruit cherry``````
``mydata\$type == "fruit"``
``##   TRUE FALSE  TRUE FALSE  TRUE``

In this case, instead of indexing the rows by number, we’re using TRUEs and FALSEs.

Exercise: Subset `mydata` to the vegetables instead of the fruit

``# Your code here``

Another handy way to subset `data.frame` is with the `subset` function:

``subset(mydata, type == "fruit") # Equivalent to mydata[mydata\$type == "fruit",]``
``````##    type   name
## 1 fruit  apple
## 3 fruit orange
## 5 fruit cherry``````

There are a lot of useful functions to help us work with `data.frame`s:

``str(mydata)``
``````## 'data.frame':    5 obs. of  2 variables:
##  \$ type: Factor w/ 2 levels "fruit","vegetable": 1 2 1 2 1
##  \$ name: Factor w/ 5 levels "apple","beet",..: 1 4 5 2 3``````
``summary(mydata)``
``````##         type         name
##  fruit    :3   apple   :1
##  vegetable:2   beet    :1
##                cherry  :1
##                eggplant:1
##                orange  :1``````

Our `data.frame`s won’t always be so small as this example one. Let’s look at a larger one:

``````library(ggplot2)
data("diamonds")
diamonds``````
``````## # A tibble: 53,940 x 10
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
##  4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
##  9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 53,930 more rows``````

Exercise: How many rows does diamonds have in it? How many columns?

We can look at the first few rows with `head`, just like on the command line:

``head(diamonds)``
``````## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48``````

or the last few:

``tail(diamonds)``
``````## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.72   Premium     D     SI1  62.7    59  2757  5.69  5.73  3.58
## 2  0.72     Ideal     D     SI1  60.8    57  2757  5.75  5.76  3.50
## 3  0.72      Good     D     SI1  63.1    55  2757  5.69  5.75  3.61
## 4  0.70 Very Good     D     SI1  62.8    60  2757  5.66  5.68  3.56
## 5  0.86   Premium     H     SI2  61.0    58  2757  6.15  6.12  3.74
## 6  0.75     Ideal     D     SI2  62.2    55  2757  5.83  5.87  3.64``````

So far, this has probably been a bit boring. Let’s do something interesting and also something that R is very good at: Plotting and modeling!

Let’s plot the relationship between diamond price and carat:

``plot(price ~ carat, data = diamonds)`` The above syntax, `price ~ carat` uses a `response ~ predictor` form or `y ~ x`.

We can also fit a linear model to the same relationship:

``mod <- lm(price ~ carat, data = diamonds)``

Above, we saved our linear model to a variable named `mod` so we can use it later. We can look at the result of model fitting with `summary`.

``summary(mod)``
``````##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -18585.3   -804.8    -18.9    537.4  12731.7
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
## carat        7756.43      14.07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16``````

And we can also plot the line of best fit on the scatterplot:

``````plot(price ~ carat, data = diamonds)
abline(mod\$coefficients[], mod\$coefficients[], col = "red", lwd = 5)``````