Introduction

Leaning outcomes

Students should

Lesson

Let’s look at an example dataset (one I’ve made) and go over some ways we might QA/QC this dataset.

It is an arbitrary temperature study with six sites. Within each site, there are 6 plots and we took 10 samples of temperature (degrees C) per plot.

Step one: Import and prepare

fundata <- read.csv("fundata.csv", stringsAsFactors = FALSE)
head(fundata) # Look at the data
##   siteplot       date temp_c
## 1      A.1 03-21-2017  25.31
## 2      B.2 04-21-2017  33.63
## 3      C.3 05-21-2017  27.09
## 4      D.4 06-21-2017  33.20
## 5      E.5 07-21-2017  27.55
## 6      F.6 08-21-2017  23.64
str(fundata)
## 'data.frame':    355 obs. of  3 variables:
##  $ siteplot: chr  "A.1" "B.2" "C.3" "D.4" ...
##  $ date    : chr  "03-21-2017" "04-21-2017" "05-21-2017" "06-21-2017" ...
##  $ temp_c  : num  25.3 33.6 27.1 33.2 27.6 ...
summary(fundata)
##    siteplot             date               temp_c        
##  Length:355         Length:355         Min.   :-9999.00  
##  Class :character   Class :character   1st Qu.:   26.41  
##  Mode  :character   Mode  :character   Median :   30.27  
##                                        Mean   : -110.80  
##                                        3rd Qu.:   33.41  
##                                        Max.   :  180.00

What’s our assessment of this dataset from the above commands?

  • The site and plot codes are smushed together in one column
  • The dates look a bit funky. What format is that?
  • The range on the temperatures surely can’t be right

Split siteplot into two columns

library(tidyr)

lettersdf <- data.frame(letters = paste(LETTERS, rev(LETTERS), sep = "."))
lettersdf
##    letters
## 1      A.Z
## 2      B.Y
## 3      C.X
## 4      D.W
## 5      E.V
## 6      F.U
## 7      G.T
## 8      H.S
## 9      I.R
## 10     J.Q
## 11     K.P
## 12     L.O
## 13     M.N
## 14     N.M
## 15     O.L
## 16     P.K
## 17     Q.J
## 18     R.I
## 19     S.H
## 20     T.G
## 21     U.F
## 22     V.E
## 23     W.D
## 24     X.C
## 25     Y.B
## 26     Z.A
lettersdf %>%
  separate(letters, c("letter_one", "letter_two"), sep = "\\.")
##    letter_one letter_two
## 1           A          Z
## 2           B          Y
## 3           C          X
## 4           D          W
## 5           E          V
## 6           F          U
## 7           G          T
## 8           H          S
## 9           I          R
## 10          J          Q
## 11          K          P
## 12          L          O
## 13          M          N
## 14          N          M
## 15          O          L
## 16          P          K
## 17          Q          J
## 18          R          I
## 19          S          H
## 20          T          G
## 21          U          F
## 22          V          E
## 23          W          D
## 24          X          C
## 25          Y          B
## 26          Z          A

Exercise: Split the siteplot column into two columns called site and plot:

# Your code here

Convert the dates to real R dates

When you have dates in R, it’s usually best to convert them to a Date object. This is mainly so plotting functions work correctly but it helps elsewhere too.

The function we’ll use is as.Date:

datestrings <- c("2000-08-03", "2017-02-20", "1980-04-27")
mydate <- as.Date(datestrings, format = "%Y-%m-%d")
mydate
## [1] "2000-08-03" "2017-02-20" "1980-04-27"
class(mydate)
## [1] "Date"

** Exercise: Convert the date column in fundata to from a character vector to a Date vector with as.Date():

# Your code here
# e.g fundata$date <- 

Step two: Checking assumptions in the data

site & plot

Let’s start by looking at the site column for potential issues.

The table function is a great way tally the occurrence of each of the unique values in a vector:

table(c(5, 5, 2, 2, 3, 3))
## 
## 2 3 5 
## 2 2 2
table(c("A", "A", "B", "C")) # Also good for character vectors
## 
## A B C 
## 2 1 1
fish <- c("gag grouper", "striped bass", "red drum", "gag grouper")
table(fish)
## fish
##  gag grouper     red drum striped bass 
##            2            1            1
table(fundata$site)
## 
## A.1 B.2 C.3 D.4 E.5 F.6 
##  58  59  60  59  59  60

From the summary of the dataset, we would expect 60s across the board there. But we don’t see that!

We can see we’re missing some observations from A, C, and E. Depending on our needs, we may need to go back to our field notes to find out what happened.

temp_c

Look with summary:

summary(fundata)
##    siteplot             date               temp_c        
##  Length:355         Length:355         Min.   :-9999.00  
##  Class :character   Class :character   1st Qu.:   26.41  
##  Mode  :character   Mode  :character   Median :   30.27  
##                                        Mean   : -110.80  
##                                        3rd Qu.:   33.41  
##                                        Max.   :  180.00

Look directly with range:

range(fundata$temp_c)
## [1] -9999   180

Exercise: Can we use an exploratory plot to check this assumption too?

Plotting your data is always a good idea and we can often find new insights or issues this way. For univariate data, a box plot is a great way to look at the distribution of values:

boxplot(rnorm(100, 50, 5))

Use boxplot to make a boxplot of the temperature values:

# Your code here

Before we move on, let’s fix the -9999 and 180 observations by removing them:

Exercise: Remove the rows with temperatures of -9999 and 180:

# Your code here
any(is.na(fundata$temp_c))
## [1] FALSE

Exercise: Remove the row with the NA as a value for temp_c

“NA” is its own type in R.

is.na(2)
## [1] FALSE
is.na(NA)
## [1] TRUE
is.na(c(1, NA, 3))
## [1] FALSE  TRUE FALSE
fish <- c("gag grouper", NA, "red drum", NA)
fish
## [1] "gag grouper" NA            "red drum"    NA
# Filter NAs in a character vector
fish[!is.na(fish)]
## [1] "gag grouper" "red drum"
# Remember we can subset the rows in a data.frame like:
fundata[c(1, 2, 3),] # First, second, third rows
##   siteplot       date temp_c
## 1      A.1 03-21-2017  25.31
## 2      B.2 04-21-2017  33.63
## 3      C.3 05-21-2017  27.09
#or
fundata[which(fundata$site == "A"),] # Just the rows with site == "A"
## [1] siteplot date     temp_c  
## <0 rows> (or 0-length row.names)
# Write an expression using `is.na` to subset fundata and save the result
# Your code here:
# e.g. fundata$site <- 

Check for duplicate values

Since have a hierarchical design, we might want to check that all the data we expect are present. If we know we have 6 sites with 6 plots at each site and ten samples, we can check this assumption a few ways:

nrow(fundata) == 6 * 6 * 10
## [1] FALSE

Exercise: Use dplyr with group_by and summarize to find which site or plot has the wrong number of observations

# Put your solution here

The functions unique and duplicated great ways to find duplicates in a vector of values. For example, we can create a vector with a duplicate value in it:

some_letters <- c("A", "F", "X", "S", "X")
some_letters
## [1] "A" "F" "X" "S" "X"
length(unique(some_letters)) == length(some_letters)
## [1] FALSE
duplicated(some_letters)
## [1] FALSE FALSE FALSE FALSE  TRUE
some_letters[!duplicated(some_letters)]
## [1] "A" "F" "X" "S"

See how nrow and unique can tell us if there are duplicates and duplicated can tell us which values are duplicates?

Exercise: Remove the duplicate value with duplicated from the following data.frame:

r mydf <- data.frame(fish = c("Redfish", "Gag Grouper", "Striped Bass", "Redfish"), size = runif(4, 50, 100), stringsAsFactors = FALSE) # Your code here

Complete analysis script:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Import
fundata <- read.csv("fundata.csv", stringsAsFactors = FALSE)

# Munge
fundata <- fundata %>% 
  separate(siteplot, c("site", "plot"), sep = "\\.")
fundata$date <- as.Date(fundata$date, "%m-%d-%Y")

# Check
any(is.na(fundata$site))
## [1] FALSE
any(fundata$temp_c < 0)
## [1] TRUE
any(fundata$temp_c > 100)
## [1] TRUE
# Fix temp_c column:
fundata <- fundata[which(!is.na(fundata$temp_c)),] # Remove the NA
fundata <- fundata[fundata$temp_c != -9999,]
fundata <- fundata[fundata$temp_c != 180,]

# Analyze
fundata %>% 
  group_by(site) %>% 
  summarise(meantemp = mean(temp_c))
## # A tibble: 6 x 2
##    site meantemp
##   <chr>    <dbl>
## 1     A 29.75293
## 2     B 31.06679
## 3     C 30.00283
## 4     D 29.67271
## 5     E 28.72138
## 6     F 31.04517

more tidyr package

We already saw how useful separate can be. tidyr provides two other functions I use all the time: gather and spread.

gather

gather takes our data that is wide (multiple observations on each row) and puts it into a tall form where each row is an observation.

# Make some 'wide' data first
fishdf <- data.frame(fish = c("Redfish", "Gag Grouper", "Striped Bass"),
                     weight = runif(3, 2, 10),
                     age = runif(3, 5, 80))
fishdf
##           fish   weight      age
## 1      Redfish 7.236070  6.36241
## 2  Gag Grouper 2.857147 30.35654
## 3 Striped Bass 8.342752 58.12205
# Gather it
fishdf_g <- gather(fishdf, variable, value, -fish)
fishdf_g
##           fish variable     value
## 1      Redfish   weight  7.236070
## 2  Gag Grouper   weight  2.857147
## 3 Striped Bass   weight  8.342752
## 4      Redfish      age  6.362410
## 5  Gag Grouper      age 30.356543
## 6 Striped Bass      age 58.122046

Why gather? Analysis often needs us to reshape the data in this way:

library(dplyr)

fishdf_g %>% 
  group_by(variable) %>% 
  summarize(meanvalue = mean(value))
## # A tibble: 2 x 2
##   variable meanvalue
##      <chr>     <dbl>
## 1      age 31.613666
## 2   weight  6.145323

Also, ggplot2 needs this form:

library(ggplot2)
ggplot(fishdf_g, aes(variable, value)) + 
  geom_boxplot()

spread

spread does the opposite of gather:

fishdf_g %>% spread(variable, value)
##           fish      age   weight
## 1  Gag Grouper 30.35654 2.857147
## 2      Redfish  6.36241 7.236070
## 3 Striped Bass 58.12205 8.342752

Why spread? Usually modeling requires this “wide” format:

lm(weight ~ age, data = fishdf)
## 
## Call:
## lm(formula = weight ~ age, data = fishdf)
## 
## Coefficients:
## (Intercept)          age  
##     5.32450      0.02596

So with the combination of gather and spread, we can easily switch between analysis and modeling.

The assertr package approach

The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline so that data errors are spotted early and can be addressed quickly.

The basic idea is that we should check qualities of our dataset prior to analysis and that we can actually make the analysis not run if certain assertions are not met.

library(assertr)

mtcars %>% verify(TRUE) %>% nrow(.)
mtcars %>% verify(FALSE) %>% nrow(.)

# Only plot mpg ~ wt if the columns exist and all mpg are > 0
mtcars %>% 
  verify(has_all_names("mpg", "wt")) %>% 
  verify(mpg > 0) %>% 
  ggplot(aes(wt, mpg)) + geom_point()

Let’s walk through the introduction taken from the README:

Let’s work with the mtcars dataset. We don’t know who created it and, in order to do an analysis with it, we might want to check a few assumptions:

  • that it has the columns “mpg”, “vs”, and “am”
  • that the dataset contains more than 10 observations
  • that the column for ‘miles per gallon’ (mpg) is a positive number
  • that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard - deviations from its mean, and
  • that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only
  • each row contains at most 2 NAs
  • each row is unique jointly between the “mpg”, “am”, and “wt” columns
  • each row’s mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)

This could be written (in order) using assertr like this:

library(dplyr)
library(assertr)

mtcars %>%
  verify(has_all_names("mpg", "vs", "am", "wt")) %>%
  verify(nrow(.) > 10) %>%
  verify(mpg > 0) %>%
  insist(within_n_sds(4), mpg) %>%
  assert(in_set(0,1), am, vs) %>%
  assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
  assert_rows(col_concat, is_uniq, mpg, am, wt) %>%
  insist_rows(maha_dist, within_n_mads(10), everything()) %>%
  group_by(cyl) %>%
  summarise(avg.mpg=mean(mpg))
## # A tibble: 3 x 2
##     cyl  avg.mpg
##   <dbl>    <dbl>
## 1     4 26.66364
## 2     6 19.74286
## 3     8 15.10000

If you look closely, the last two lines in the above chunk are our analysis, but everything before it are assertions.

Exercise: Let’s do an assertr analysis pipeline with the starwars dataset

library(dplyr)
head(starwars)
## # A tibble: 6 x 13
##             name height  mass  hair_color  skin_color eye_color birth_year
##            <chr>  <int> <dbl>       <chr>       <chr>     <chr>      <dbl>
## 1 Luke Skywalker    172    77       blond        fair      blue       19.0
## 2          C-3PO    167    75        <NA>        gold    yellow      112.0
## 3          R2-D2     96    32        <NA> white, blue       red       33.0
## 4    Darth Vader    202   136        none       white    yellow       41.9
## 5    Leia Organa    150    49       brown       light     brown       19.0
## 6      Owen Lars    178   120 brown, grey       light      blue       52.0
## # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

What are some things we might want to analyze?

  • Average height/mass by species/gender/homeworld?

What are some things we might like to assert before we do that analysis?

  • birth_year/height/mass is a positive real number
  • no Droids should have Gender (maybe)?

Summary

Extra things to do:

starwars dataset

Exercise: Check if any droids have hair or gender

library(dplyr)
head(starwars)
## # A tibble: 6 x 13
##             name height  mass  hair_color  skin_color eye_color birth_year
##            <chr>  <int> <dbl>       <chr>       <chr>     <chr>      <dbl>
## 1 Luke Skywalker    172    77       blond        fair      blue       19.0
## 2          C-3PO    167    75        <NA>        gold    yellow      112.0
## 3          R2-D2     96    32        <NA> white, blue       red       33.0
## 4    Darth Vader    202   136        none       white    yellow       41.9
## 5    Leia Organa    150    49       brown       light     brown       19.0
## 6      Owen Lars    178   120 brown, grey       light      blue       52.0
## # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>
# Your code here

Exercise: Make sure every character has been in at least one film

# Your code here

Misc

This chunk generates the fundata.csv file we used in this lesson

n <- 6
reps <- 10
fundata <- data.frame(site = rep(rep(c("A", "B", "C", "D", "E", "F"), n), reps),
plot = rep(rep(c(1:6), n), reps),
date = rep(c("03-21-2017", "04-21-2017", "05-21-2017", "06-21-2017", "07-21-2017", "08-21-2017"), reps),
temp_c = round(rnorm(n * n * reps, 30, 5), 2),
stringsAsFactors = FALSE)

# omit 5 rows at random
fundata <- fundata[-sample(1:nrow(fundata), 5, replace = FALSE),]

# replace 5 values with -999
fundata[sample(1:nrow(fundata), 5, replace = FALSE), "temp_c"] <- -9999

# replace 1 value with a way-too-high value
fundata[sample(1:nrow(fundata), 1), "temp_c"] <- 180

# mush site and plot together into one column
fundata$siteplot <- paste(fundata$site, fundata$plot, sep = ".")

# clean it up and save it
fundata <- fundata[,c("siteplot", "date", "temp_c")]
write.csv(fundata, row.names = FALSE, file = "fundata.csv")