8 Hands On: Clean and Integrate Datasets
8.1 Learning Objectives
In this lesson, you will:
- Clean and integrate two datasets using dplyr and tidyr
- Make use of previously-learned knowledge of dplyr and tidyr
8.2 Outline
In this one hour block, you will load data from the following two datasets into R,
and then clean, and integrate them together to answer a research question:
Are Sockeye salmon escapement goals being met in recent years in Bristol Bay?
Depending on your familiarity with dplyr and tidyr, you will probably want to look up how to do things. I suggest two strategies:
- Look back on the Data Cleaning and Manipulation lesson
- Use the official dplyr documentation
- Once you know what function to use, use R’s built-in help by prepending a
?
to the function name and running that (e.g., run?select
to get help on theselect
function)
8.3 High-level steps
The goal here is for you to have to come up with the functions to do the analysis with minimal guidance. This is supposed to be hard. Below is a set of high-level steps you can follow to answer our research question. After the list is a schematic of the steps in table form which I expect will be useful in guiding your code.
Note: This need not be the exaxct order your code is written in.
- Load our two datasets
- Load the escapement goals CSV into R as a
data.frame
- Visit https://knb.ecoinformatics.org/#data and search for “escapement goals” and choose the 2007-2015 dataset
- Click the following dataset:
Andrew Munro and Eric Volk. 2017. Summary of Pacific Salmon Escapement Goals in Alaska with a Review of Escapements from 2007 to 2015. Knowledge Network for Biocomplexity.
- Right-click and copy address for the file
MandV2016
- Load the escapement counts CSV into R as a
data.frame
- Visit https://knb.ecoinformatics.org/#data and search for ‘oceanak’
- Click the following dataset:
Alaska Department of Fish and Game. 2017. Daily salmon escapement counts from the OceanAK database, Alaska, 1921-2017. Knowledge Network for Biocomplexity.
- Right-click and copy address for the file
ADFG_firstAttempt_reformatted.csv
- Load the escapement goals CSV into R as a
- Clean
- Clean the escapement goals dataset
- Filter to just the Bristol Bay region and the Sockeye salmon species
- Check whether the column types are wrong and fix any issues (Hint: One column has the wrong type)
- Clean the escapement counts dataset
- Filter to just the Bristol Bay region and the Sockeye salmon species
- Filter to just stocks we have escapement goals for
- Create new columns for the year, month, and day so we can calculate total escapements by year and stock
- Calculate annual total escapements for each stock
- Clean the escapement goals dataset
- Integrate
- Join the escapement goal lower and upper bounds onto the annual total escapement counts (Hint: We don’t need all the columns)
- Analyze
- Make a table listing annual total escapements and whether they were in the escapement goal range or not
- Calculate the proportion of years, for each stock, total escapement was within the escapement goal range
8.3.1 Visual schematic of steps
Make this:
System Lower Upper
Kvichak River 2000000 10000000
Naknek River 800000 2000000
Egegik River 800000 2000000
Ugashik River 500000 1400000
Wood River 700000 1800000
Igushik River 150000 400000
Nushagak River 260000 760000
Nushagak River 370000 900000
Togiak River 120000 270000
and then make this:
Location Year Escapement
Egegik River 2012 1233900
Egegik River 2013 1113630
Egegik River 2014 1382466
Egegik River 2015 2160792
Egegik River 2016 1837260
Igushik River 2012 193326
Igushik River 2013 387744
Igushik River 2014 340590
Igushik River 2015 651172
Igushik River 2016 469230
and join them together to make this:
Location Year Escapement Lower Upper is_in_range
Egegik River 2012 1233900 800000 2000000 TRUE
Egegik River 2013 1113630 800000 2000000 TRUE
Egegik River 2014 1382466 800000 2000000 TRUE
Egegik River 2015 2160792 800000 2000000 FALSE
Egegik River 2016 1837260 800000 2000000 TRUE
Igushik River 2012 193326 150000 400000 TRUE
Igushik River 2013 387744 150000 400000 TRUE
Igushik River 2014 340590 150000 400000 TRUE
Igushik River 2015 651172 150000 400000 FALSE
Igushik River 2016 469230 150000 400000 FALSE
8.4 Full solution
Warning: Spoilers!
First we’ll load our packages:
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(DT) # Just for display purposes
})
Then download our two data files and save them as data.frame
s:
# https://knb.ecoinformatics.org/#view/urn:uuid:8809a404-f6e1-46a2-91c8-f094c3814b47
# Search "OceanAK"
esc <- read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/knb.92020.1", method = "libcurl"),
stringsAsFactors = FALSE)
# https://knb.ecoinformatics.org/#view/urn:uuid:2c13bb02-6148-4ab6-8ad5-3a7c03f8642e
# Search "escapement goals", choose 2007-2015 dataset
goals <- read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/knb.92014.1", method = "libcurl"),
stringsAsFactors = FALSE)
First, we’ll clean up the escapement goals data.frame
to have just the rows and columns we need and display it:
bb_sockeye_goals <- goals %>%
filter(Region == "Bristol Bay", Species == "Sockeye") %>%
mutate(Lower = as.integer(Lower)) %>%
select(System, Lower, Upper) %>%
drop_na()
## Warning in evalq(as.integer(Lower), <environment>): NAs introduced by
## coercion
datatable(bb_sockeye_goals)
Then we’ll clean up and summarize the escapement counts data.frame
, join the escapement goals data.frame
onto it, and calculate whether goals have been met:
bb_sockeye_escapements <- esc %>%
filter(SASAP.Region == "Bristol Bay",
Species == "Sockeye",
Location %in% bb_sockeye_goals$System) %>%
separate(sampleDate, c("Year", "Month", "Day"), sep = "-") %>%
group_by(Location, Year) %>%
summarize(Escapement = sum(DailyCount)) %>%
left_join(bb_sockeye_goals %>% select(System, Lower, Upper), by = c("Location" = "System")) %>%
mutate(is_in_range = Escapement >= Lower & Escapement <= Upper)
datatable(bb_sockeye_escapements)