8 Hands On: Clean and Integrate Datasets

8.1 Learning Objectives

In this lesson, you will:

  • Clean and integrate two datasets using dplyr and tidyr
  • Make use of previously-learned knowledge of dplyr and tidyr

8.2 Outline

In this one hour block, you will load data from the following two datasets into R,

and then clean, and integrate them together to answer a research question:

Are Sockeye salmon escapement goals being met in recent years in Bristol Bay?

Depending on your familiarity with dplyr and tidyr, you will probably want to look up how to do things. I suggest two strategies:

  1. Look back on the Data Cleaning and Manipulation lesson
  2. Use the official dplyr documentation
  3. Once you know what function to use, use R’s built-in help by prepending a ? to the function name and running that (e.g., run ?select to get help on the select function)

8.3 High-level steps

The goal here is for you to have to come up with the functions to do the analysis with minimal guidance. This is supposed to be hard. Below is a set of high-level steps you can follow to answer our research question. After the list is a schematic of the steps in table form which I expect will be useful in guiding your code.

Note: This need not be the exaxct order your code is written in.

  1. Load our two datasets
    • Load the escapement goals CSV into R as a data.frame

      Andrew Munro and Eric Volk. 2017. Summary of Pacific Salmon Escapement Goals in Alaska with a Review of Escapements from 2007 to 2015. Knowledge Network for Biocomplexity.

      • Right-click and copy address for the file MandV2016
    • Load the escapement counts CSV into R as a data.frame

      Alaska Department of Fish and Game. 2017. Daily salmon escapement counts from the OceanAK database, Alaska, 1921-2017. Knowledge Network for Biocomplexity.

      • Right-click and copy address for the file ADFG_firstAttempt_reformatted.csv
  2. Clean
    1. Clean the escapement goals dataset
      1. Filter to just the Bristol Bay region and the Sockeye salmon species
      2. Check whether the column types are wrong and fix any issues (Hint: One column has the wrong type)
    2. Clean the escapement counts dataset
      1. Filter to just the Bristol Bay region and the Sockeye salmon species
      2. Filter to just stocks we have escapement goals for
      3. Create new columns for the year, month, and day so we can calculate total escapements by year and stock
    3. Calculate annual total escapements for each stock
  3. Integrate
    • Join the escapement goal lower and upper bounds onto the annual total escapement counts (Hint: We don’t need all the columns)
  4. Analyze
    • Make a table listing annual total escapements and whether they were in the escapement goal range or not
    • Calculate the proportion of years, for each stock, total escapement was within the escapement goal range

8.3.1 Visual schematic of steps

Make this:

        System   Lower    Upper
 Kvichak River 2000000 10000000
  Naknek River  800000  2000000
  Egegik River  800000  2000000
 Ugashik River  500000  1400000
    Wood River  700000  1800000
 Igushik River  150000   400000
Nushagak River  260000   760000
Nushagak River  370000   900000
  Togiak River  120000   270000

and then make this:

     Location  Year Escapement
 Egegik River  2012    1233900
 Egegik River  2013    1113630
 Egegik River  2014    1382466
 Egegik River  2015    2160792
 Egegik River  2016    1837260
Igushik River  2012     193326
Igushik River  2013     387744
Igushik River  2014     340590
Igushik River  2015     651172
Igushik River  2016     469230

and join them together to make this:

     Location  Year Escapement  Lower   Upper is_in_range
 Egegik River  2012    1233900 800000 2000000        TRUE
 Egegik River  2013    1113630 800000 2000000        TRUE
 Egegik River  2014    1382466 800000 2000000        TRUE
 Egegik River  2015    2160792 800000 2000000       FALSE
 Egegik River  2016    1837260 800000 2000000        TRUE
Igushik River  2012     193326 150000  400000        TRUE
Igushik River  2013     387744 150000  400000        TRUE
Igushik River  2014     340590 150000  400000        TRUE
Igushik River  2015     651172 150000  400000       FALSE
Igushik River  2016     469230 150000  400000       FALSE

8.4 Full solution

Warning: Spoilers!

First we’ll load our packages:

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(DT) # Just for display purposes
})

Then download our two data files and save them as data.frames:

# https://knb.ecoinformatics.org/#view/urn:uuid:8809a404-f6e1-46a2-91c8-f094c3814b47
# Search "OceanAK"
esc <- read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/knb.92020.1", method = "libcurl"), 
                    stringsAsFactors = FALSE)

# https://knb.ecoinformatics.org/#view/urn:uuid:2c13bb02-6148-4ab6-8ad5-3a7c03f8642e
# Search "escapement goals", choose 2007-2015 dataset
goals <- read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/knb.92014.1", method = "libcurl"),
                  stringsAsFactors = FALSE)

First, we’ll clean up the escapement goals data.frame to have just the rows and columns we need and display it:

bb_sockeye_goals <- goals %>% 
  filter(Region == "Bristol Bay", Species == "Sockeye") %>% 
  mutate(Lower = as.integer(Lower)) %>% 
  select(System, Lower, Upper) %>% 
  drop_na()
## Warning in evalq(as.integer(Lower), <environment>): NAs introduced by
## coercion
datatable(bb_sockeye_goals)

Then we’ll clean up and summarize the escapement counts data.frame, join the escapement goals data.frame onto it, and calculate whether goals have been met:

bb_sockeye_escapements <- esc %>% 
  filter(SASAP.Region == "Bristol Bay", 
         Species == "Sockeye",
         Location %in% bb_sockeye_goals$System) %>%
  separate(sampleDate, c("Year", "Month", "Day"), sep = "-") %>% 
  group_by(Location, Year) %>% 
  summarize(Escapement = sum(DailyCount)) %>% 
  left_join(bb_sockeye_goals %>% select(System, Lower, Upper), by = c("Location" = "System")) %>% 
  mutate(is_in_range = Escapement >= Lower & Escapement <= Upper)

datatable(bb_sockeye_escapements)