Students should
assertr
package for analysis QALet’s look at an example dataset (one I’ve made) and go over some ways we might QA/QC this dataset.
It is an arbitrary temperature study with six sites. Within each site, there are 6 plots and we took 10 samples of temperature (degrees C) per plot.
fundata <- read.csv("fundata.csv", stringsAsFactors = FALSE)
head(fundata) # Look at the data
## siteplot date temp_c
## 1 A.1 03-21-2017 25.31
## 2 B.2 04-21-2017 33.63
## 3 C.3 05-21-2017 27.09
## 4 D.4 06-21-2017 33.20
## 5 E.5 07-21-2017 27.55
## 6 F.6 08-21-2017 23.64
str(fundata)
## 'data.frame': 355 obs. of 3 variables:
## $ siteplot: chr "A.1" "B.2" "C.3" "D.4" ...
## $ date : chr "03-21-2017" "04-21-2017" "05-21-2017" "06-21-2017" ...
## $ temp_c : num 25.3 33.6 27.1 33.2 27.6 ...
summary(fundata)
## siteplot date temp_c
## Length:355 Length:355 Min. :-9999.00
## Class :character Class :character 1st Qu.: 26.41
## Mode :character Mode :character Median : 30.27
## Mean : -110.80
## 3rd Qu.: 33.41
## Max. : 180.00
What’s our assessment of this dataset from the above commands?
library(tidyr)
lettersdf <- data.frame(letters = paste(LETTERS, rev(LETTERS), sep = "."))
lettersdf
## letters
## 1 A.Z
## 2 B.Y
## 3 C.X
## 4 D.W
## 5 E.V
## 6 F.U
## 7 G.T
## 8 H.S
## 9 I.R
## 10 J.Q
## 11 K.P
## 12 L.O
## 13 M.N
## 14 N.M
## 15 O.L
## 16 P.K
## 17 Q.J
## 18 R.I
## 19 S.H
## 20 T.G
## 21 U.F
## 22 V.E
## 23 W.D
## 24 X.C
## 25 Y.B
## 26 Z.A
lettersdf %>%
separate(letters, c("letter_one", "letter_two"), sep = "\\.")
## letter_one letter_two
## 1 A Z
## 2 B Y
## 3 C X
## 4 D W
## 5 E V
## 6 F U
## 7 G T
## 8 H S
## 9 I R
## 10 J Q
## 11 K P
## 12 L O
## 13 M N
## 14 N M
## 15 O L
## 16 P K
## 17 Q J
## 18 R I
## 19 S H
## 20 T G
## 21 U F
## 22 V E
## 23 W D
## 24 X C
## 25 Y B
## 26 Z A
Exercise: Split the siteplot
column into two columns called site
and plot
:
# Your code here
When you have dates in R, it’s usually best to convert them to a Date
object. This is mainly so plotting functions work correctly but it helps elsewhere too.
The function we’ll use is as.Date
:
datestrings <- c("2000-08-03", "2017-02-20", "1980-04-27")
mydate <- as.Date(datestrings, format = "%Y-%m-%d")
mydate
## [1] "2000-08-03" "2017-02-20" "1980-04-27"
class(mydate)
## [1] "Date"
** Exercise: Convert the date
column in fundata
to from a character vector to a Date vector with as.Date()
:
# Your code here
# e.g fundata$date <-
Let’s start by looking at the site
column for potential issues.
The table
function is a great way tally the occurrence of each of the unique values in a vector:
table(c(5, 5, 2, 2, 3, 3))
##
## 2 3 5
## 2 2 2
table(c("A", "A", "B", "C")) # Also good for character vectors
##
## A B C
## 2 1 1
fish <- c("gag grouper", "striped bass", "red drum", "gag grouper")
table(fish)
## fish
## gag grouper red drum striped bass
## 2 1 1
table(fundata$site)
##
## A.1 B.2 C.3 D.4 E.5 F.6
## 58 59 60 59 59 60
From the summary of the dataset, we would expect 60s across the board there. But we don’t see that!
We can see we’re missing some observations from A, C, and E. Depending on our needs, we may need to go back to our field notes to find out what happened.
Look with summary
:
summary(fundata)
## siteplot date temp_c
## Length:355 Length:355 Min. :-9999.00
## Class :character Class :character 1st Qu.: 26.41
## Mode :character Mode :character Median : 30.27
## Mean : -110.80
## 3rd Qu.: 33.41
## Max. : 180.00
Look directly with range
:
range(fundata$temp_c)
## [1] -9999 180
Exercise: Can we use an exploratory plot to check this assumption too?
Plotting your data is always a good idea and we can often find new insights or issues this way. For univariate data, a box plot is a great way to look at the distribution of values:
boxplot(rnorm(100, 50, 5))
Use boxplot
to make a boxplot of the temperature values:
# Your code here
Before we move on, let’s fix the -9999 and 180 observations by removing them:
Exercise: Remove the rows with temperatures of -9999 and 180:
# Your code here
any(is.na(fundata$temp_c))
## [1] FALSE
Exercise: Remove the row with the NA
as a value for temp_c
“NA” is its own type in R.
is.na(2)
## [1] FALSE
is.na(NA)
## [1] TRUE
is.na(c(1, NA, 3))
## [1] FALSE TRUE FALSE
fish <- c("gag grouper", NA, "red drum", NA)
fish
## [1] "gag grouper" NA "red drum" NA
# Filter NAs in a character vector
fish[!is.na(fish)]
## [1] "gag grouper" "red drum"
# Remember we can subset the rows in a data.frame like:
fundata[c(1, 2, 3),] # First, second, third rows
## siteplot date temp_c
## 1 A.1 03-21-2017 25.31
## 2 B.2 04-21-2017 33.63
## 3 C.3 05-21-2017 27.09
#or
fundata[which(fundata$site == "A"),] # Just the rows with site == "A"
## [1] siteplot date temp_c
## <0 rows> (or 0-length row.names)
# Write an expression using `is.na` to subset fundata and save the result
# Your code here:
# e.g. fundata$site <-
Since have a hierarchical design, we might want to check that all the data we expect are present. If we know we have 6 sites with 6 plots at each site and ten samples, we can check this assumption a few ways:
nrow(fundata) == 6 * 6 * 10
## [1] FALSE
Exercise: Use dplyr
with group_by
and summarize
to find which site
or plot
has the wrong number of observations
# Put your solution here
The functions unique
and duplicated
great ways to find duplicates in a vector of values. For example, we can create a vector with a duplicate value in it:
some_letters <- c("A", "F", "X", "S", "X")
some_letters
## [1] "A" "F" "X" "S" "X"
length(unique(some_letters)) == length(some_letters)
## [1] FALSE
duplicated(some_letters)
## [1] FALSE FALSE FALSE FALSE TRUE
some_letters[!duplicated(some_letters)]
## [1] "A" "F" "X" "S"
See how nrow
and unique
can tell us if there are duplicates and duplicated
can tell us which values are duplicates?
Exercise: Remove the duplicate value with duplicated from the following data.frame:
r mydf <- data.frame(fish = c("Redfish", "Gag Grouper", "Striped Bass", "Redfish"), size = runif(4, 50, 100), stringsAsFactors = FALSE) # Your code here
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Import
fundata <- read.csv("fundata.csv", stringsAsFactors = FALSE)
# Munge
fundata <- fundata %>%
separate(siteplot, c("site", "plot"), sep = "\\.")
fundata$date <- as.Date(fundata$date, "%m-%d-%Y")
# Check
any(is.na(fundata$site))
## [1] FALSE
any(fundata$temp_c < 0)
## [1] TRUE
any(fundata$temp_c > 100)
## [1] TRUE
# Fix temp_c column:
fundata <- fundata[which(!is.na(fundata$temp_c)),] # Remove the NA
fundata <- fundata[fundata$temp_c != -9999,]
fundata <- fundata[fundata$temp_c != 180,]
# Analyze
fundata %>%
group_by(site) %>%
summarise(meantemp = mean(temp_c))
## # A tibble: 6 x 2
## site meantemp
## <chr> <dbl>
## 1 A 29.75293
## 2 B 31.06679
## 3 C 30.00283
## 4 D 29.67271
## 5 E 28.72138
## 6 F 31.04517
tidyr
packageWe already saw how useful separate
can be. tidyr
provides two other functions I use all the time: gather
and spread
.
gather
gather
takes our data that is wide (multiple observations on each row) and puts it into a tall form where each row is an observation.
# Make some 'wide' data first
fishdf <- data.frame(fish = c("Redfish", "Gag Grouper", "Striped Bass"),
weight = runif(3, 2, 10),
age = runif(3, 5, 80))
fishdf
## fish weight age
## 1 Redfish 7.236070 6.36241
## 2 Gag Grouper 2.857147 30.35654
## 3 Striped Bass 8.342752 58.12205
# Gather it
fishdf_g <- gather(fishdf, variable, value, -fish)
fishdf_g
## fish variable value
## 1 Redfish weight 7.236070
## 2 Gag Grouper weight 2.857147
## 3 Striped Bass weight 8.342752
## 4 Redfish age 6.362410
## 5 Gag Grouper age 30.356543
## 6 Striped Bass age 58.122046
Why gather? Analysis often needs us to reshape the data in this way:
library(dplyr)
fishdf_g %>%
group_by(variable) %>%
summarize(meanvalue = mean(value))
## # A tibble: 2 x 2
## variable meanvalue
## <chr> <dbl>
## 1 age 31.613666
## 2 weight 6.145323
Also, ggplot2 needs this form:
library(ggplot2)
ggplot(fishdf_g, aes(variable, value)) +
geom_boxplot()
spread
spread
does the opposite of gather
:
fishdf_g %>% spread(variable, value)
## fish age weight
## 1 Gag Grouper 30.35654 2.857147
## 2 Redfish 6.36241 7.236070
## 3 Striped Bass 58.12205 8.342752
Why spread? Usually modeling requires this “wide” format:
lm(weight ~ age, data = fishdf)
##
## Call:
## lm(formula = weight ~ age, data = fishdf)
##
## Coefficients:
## (Intercept) age
## 5.32450 0.02596
So with the combination of gather
and spread
, we can easily switch between analysis and modeling.
assertr
package approachThe assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline so that data errors are spotted early and can be addressed quickly.
The basic idea is that we should check qualities of our dataset prior to analysis and that we can actually make the analysis not run if certain assertions are not met.
library(assertr)
mtcars %>% verify(TRUE) %>% nrow(.)
mtcars %>% verify(FALSE) %>% nrow(.)
# Only plot mpg ~ wt if the columns exist and all mpg are > 0
mtcars %>%
verify(has_all_names("mpg", "wt")) %>%
verify(mpg > 0) %>%
ggplot(aes(wt, mpg)) + geom_point()
Let’s walk through the introduction taken from the README:
Let’s work with the mtcars
dataset. We don’t know who created it and, in order to do an analysis with it, we might want to check a few assumptions:
This could be written (in order) using assertr like this:
library(dplyr)
library(assertr)
mtcars %>%
verify(has_all_names("mpg", "vs", "am", "wt")) %>%
verify(nrow(.) > 10) %>%
verify(mpg > 0) %>%
insist(within_n_sds(4), mpg) %>%
assert(in_set(0,1), am, vs) %>%
assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
assert_rows(col_concat, is_uniq, mpg, am, wt) %>%
insist_rows(maha_dist, within_n_mads(10), everything()) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
## # A tibble: 3 x 2
## cyl avg.mpg
## <dbl> <dbl>
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
If you look closely, the last two lines in the above chunk are our analysis, but everything before it are assertions.
Exercise: Let’s do an assertr analysis pipeline with the starwars dataset
library(dplyr)
head(starwars)
## # A tibble: 6 x 13
## name height mass hair_color skin_color eye_color birth_year
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
## 1 Luke Skywalker 172 77 blond fair blue 19.0
## 2 C-3PO 167 75 <NA> gold yellow 112.0
## 3 R2-D2 96 32 <NA> white, blue red 33.0
## 4 Darth Vader 202 136 none white yellow 41.9
## 5 Leia Organa 150 49 brown light brown 19.0
## 6 Owen Lars 178 120 brown, grey light blue 52.0
## # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
What are some things we might want to analyze?
What are some things we might like to assert before we do that analysis?
tidyr
package provides a few very useful functions that do things not easily done in base Rassertr
package shows a new methodology where the assertions are built into the analysis pipelinestarwars
datasetExercise: Check if any droids have hair or gender
library(dplyr)
head(starwars)
## # A tibble: 6 x 13
## name height mass hair_color skin_color eye_color birth_year
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
## 1 Luke Skywalker 172 77 blond fair blue 19.0
## 2 C-3PO 167 75 <NA> gold yellow 112.0
## 3 R2-D2 96 32 <NA> white, blue red 33.0
## 4 Darth Vader 202 136 none white yellow 41.9
## 5 Leia Organa 150 49 brown light brown 19.0
## 6 Owen Lars 178 120 brown, grey light blue 52.0
## # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
# Your code here
Exercise: Make sure every character has been in at least one film
# Your code here
This chunk generates the fundata.csv file we used in this lesson
n <- 6
reps <- 10
fundata <- data.frame(site = rep(rep(c("A", "B", "C", "D", "E", "F"), n), reps),
plot = rep(rep(c(1:6), n), reps),
date = rep(c("03-21-2017", "04-21-2017", "05-21-2017", "06-21-2017", "07-21-2017", "08-21-2017"), reps),
temp_c = round(rnorm(n * n * reps, 30, 5), 2),
stringsAsFactors = FALSE)
# omit 5 rows at random
fundata <- fundata[-sample(1:nrow(fundata), 5, replace = FALSE),]
# replace 5 values with -999
fundata[sample(1:nrow(fundata), 5, replace = FALSE), "temp_c"] <- -9999
# replace 1 value with a way-too-high value
fundata[sample(1:nrow(fundata), 1), "temp_c"] <- 180
# mush site and plot together into one column
fundata$siteplot <- paste(fundata$site, fundata$plot, sep = ".")
# clean it up and save it
fundata <- fundata[,c("siteplot", "date", "temp_c")]
write.csv(fundata, row.names = FALSE, file = "fundata.csv")