Data wrangling slide deck

Introduction

Philosophy of tidyverse Tidy data make your life easier. Data strutures should match intuition and common sense. Data should have logical structure. Rows are are observations, columns are variables. Tidy data also increase the viability that others can use your data, do better science, reuse science, and help you and your ideas survive and thrive. A workflow should also include the wrangling you did to get your data ready. If data are already very clean in a spreadsheet, they can easily become a literate, logical dataframe. Nonetheless, you should still use annotation within the introductory code to explain the meta-data of your data to some extent and what you did pre-R to get it tidy. The philosophy here is very similar to the data viz lesson forthcoming with two dominant paradigms. Base R code functions, and pipes %>% and the logic embodied within the libraries associated with the the tidyverse. Generally, I prefer the tidyverse because it is more logical and much easier to remember. It also has some specific functions needed to address common errors in modern data structures with little code needed to fix them up.

Why to love the tidverse

Learning outcomes

  1. To appreciate the differences between base R and tidyverse packages in terms of function and grammmar.
  2. To be able to use dplyr functions to wrangle data and solves common challenges in datasets.
  3. To be able to check and address missing values.
  4. To be able to grab part of a dataset.
  5. To use pipes to move/wrangle chunks of your dataset.

Key concepts listed

Data wrangling

Base R key concepts: aggregate tapply sapply lappy subsetting as.factor is.numeric na

tidyverse key concepts: pipes are you best friend! %>%

dplyr filter for rows select for columns mutate for new variables summarise for bringing together many values

Additional resources

Excellent list of wrangling tools

Cheatsheet

tidyr

Great wrangling webinar

Here are some exercises to address basic challenges you will certainly encounter.

Exercise 1. Missing data

Missing data are typically termed NAs. Missing data are not necessarily the same as true zeros. See Mixed Effects Models and Extension in Ecology with R for an excellent discussion of this. Missing data due to experimental design, observer error, or failures to detect a process are ‘false zeros’ and equivalent to missing data. True negatives (measured and detected absences) are recommended to be coded as a zero.

The read_csv from the tidyverse handles data import well, does not assume factors, and typically takes a parsimonous approach to vector interpretation. The analog from base R is read.csv but comes with many limitations.

#Missing data. In error, missing cells/observations in some measure can kick back an error. In other apps, sometimes ignored but can introduce error.
#setwd('data-wrangling')
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
ttc <-read_csv("data/ttc.csv") #The Toronto Transit Commission Data.
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   `FARE MEDIA` = col_character()
## )
## See spec(...) for full column specifications.
ttc #tibble produced if use read_csv versus dataframe. Tibbles include str() list of vector attributes at the top of output.
## # A tibble: 31 x 32
##            `FARE MEDIA` `2015` `2014` `2013` `2012` `2011` `2010` `2009`
##                   <chr>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
##  1                ADULT     NA     NA     NA     NA     NA     NA     NA
##  2               TOKENS 110945 111157 112360 117962 124748 120366 114686
##  3              TICKETS     NA     NA     NA     NA     NA   1298   8807
##  4             TWO-FARE     NA     NA     NA     NA     NA     NA     NA
##  5               PRESTO  13323   9862   8194   4399   1139      0      0
##  6 REGULAR MONTHLY PASS 204509 214932 213982 205086 194928 203101 208172
##  7  POST-SECONDARY PASS  48396  42855  38426  35019  32091   9200     NA
##  8         TWIN-GO PASS     NA     NA     NA     NA     NA     NA     NA
##  9          WEEKLY PASS   8843   9361   9557  10185   9893   9237   8738
## 10                 CASH  48873  49120  48623  46467  43795  43149  41445
## # ... with 21 more rows, and 24 more variables: `2008` <int>,
## #   `2007` <int>, `2006` <int>, `2005` <int>, `2004` <int>, `2003` <int>,
## #   `2002` <int>, `2001` <int>, `2000` <int>, `1999` <int>, `1998` <int>,
## #   `1997` <int>, `1996` <int>, `1995` <int>, `1994` <int>, `1993` <int>,
## #   `1992` <int>, `1991` <int>, `1990` <int>, `1989` <int>, `1988` <int>,
## #   `1987` <int>, `1986` <int>, `1985` <int>
#check for missing values
is.na(ttc) #returns a logical vector, true is missing (i.e. NA), false is present
##       FARE MEDIA  2015  2014  2013  2012  2011  2010  2009  2008  2007
##  [1,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [2,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [4,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [5,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [6,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
##  [8,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [9,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [17,]      FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [18,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [21,]      FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [22,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,]      FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [24,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [30,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##        2006  2005  2004  2003  2002  2001  2000  1999  1998  1997  1996
##  [1,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [4,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [5,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [8,]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [9,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [17,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [21,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##        1995  1994  1993  1992  1991  1990  1989  1988  1987  1986  1985
##  [1,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [5,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [8,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [9,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [17,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [21,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
summary(ttc, na.rm=TRUE) #excludes NA
##   FARE MEDIA             2015             2014             2013       
##  Length:31          Min.   :    10   Min.   :    12   Min.   :   401  
##  Class :character   1st Qu.:  1076   1st Qu.:  6087   1st Qu.:  7118  
##  Mode  :character   Median : 12170   Median : 10802   Median : 10850  
##                     Mean   : 68974   Mean   : 75560   Mean   : 77843  
##                     3rd Qu.: 48634   3rd Qu.: 49120   3rd Qu.: 52732  
##                     Max.   :534005   Max.   :534815   Max.   :525194  
##                     NA's   :8        NA's   :10       NA's   :11      
##       2012             2011             2010             2009       
##  Min.   :   372   Min.   :   344   Min.   :     0   Min.   :     0  
##  1st Qu.:  5141   1st Qu.:  4840   1st Qu.:  2539   1st Qu.:  4747  
##  Median : 11224   Median : 10690   Median :  9237   Median :  9844  
##  Mean   : 76162   Mean   : 74148   Mean   : 67353   Mean   : 69782  
##  3rd Qu.: 51249   3rd Qu.: 49146   3rd Qu.: 43149   3rd Qu.: 46170  
##  Max.   :514007   Max.   :500219   Max.   :477357   Max.   :471233  
##  NA's   :11       NA's   :11       NA's   :10       NA's   :11      
##       2008             2007             2006             2005       
##  Min.   :   310   Min.   :   295   Min.   :    58   Min.   :    93  
##  1st Qu.:  5334   1st Qu.:  4752   1st Qu.:  3978   1st Qu.:  3261  
##  Median : 11035   Median : 10892   Median : 10120   Median :  9606  
##  Mean   : 72806   Mean   : 71736   Mean   : 65906   Mean   : 63984  
##  3rd Qu.: 49701   3rd Qu.: 62491   3rd Qu.: 61156   3rd Qu.: 63630  
##  Max.   :466700   Max.   :459769   Max.   :444544   Max.   :431220  
##  NA's   :12       NA's   :12       NA's   :11       NA's   :11      
##       2004             2003             2002             2001       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  4184   1st Qu.:  3805   1st Qu.:  3264   1st Qu.:  2800  
##  Median :  9940   Median : 10586   Median : 10404   Median : 10765  
##  Mean   : 65379   Mean   : 63412   Mean   : 61538   Mean   : 62471  
##  3rd Qu.: 66028   3rd Qu.: 65337   3rd Qu.: 64830   3rd Qu.: 65670  
##  Max.   :418099   Max.   :405412   Max.   :415539   Max.   :419993  
##  NA's   :12       NA's   :12       NA's   :11       NA's   :11      
##       2000             1999             1998             1997       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  2558   1st Qu.:  2370   1st Qu.:  2283   1st Qu.:  1834  
##  Median : 10593   Median :  9655   Median :  9530   Median :  9157  
##  Mean   : 61079   Mean   : 58365   Mean   : 57806   Mean   : 56543  
##  3rd Qu.: 64276   3rd Qu.: 62428   3rd Qu.: 64004   3rd Qu.: 65008  
##  Max.   :410558   Max.   :392593   Max.   :388689   Max.   :379883  
##  NA's   :11       NA's   :11       NA's   :11       NA's   :11      
##       1996             1995             1994             1993       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  1652   1st Qu.:  1976   1st Qu.:  2111   1st Qu.:  2507  
##  Median :  9098   Median : 11338   Median : 12192   Median : 12544  
##  Mean   : 52806   Mean   : 55031   Mean   : 55096   Mean   : 58668  
##  3rd Qu.: 67164   3rd Qu.: 70369   3rd Qu.: 62700   3rd Qu.: 61242  
##  Max.   :372430   Max.   :388152   Max.   :388252   Max.   :393485  
##  NA's   :10       NA's   :10       NA's   :10       NA's   :11      
##       1992             1991             1990             1989       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  1858   1st Qu.:  2904   1st Qu.:  2914   1st Qu.:  2388  
##  Median : 11385   Median : 14268   Median : 15488   Median : 15177  
##  Mean   : 60409   Mean   : 66712   Mean   : 72297   Mean   : 70940  
##  3rd Qu.: 60610   3rd Qu.: 64236   3rd Qu.: 70048   3rd Qu.: 69216  
##  Max.   :404251   Max.   :424167   Max.   :459234   Max.   :450726  
##  NA's   :11       NA's   :12       NA's   :12       NA's   :12      
##       1988             1987             1986             1985       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  2525   1st Qu.:  3608   1st Qu.:  3427   1st Qu.:  2519  
##  Median : 16369   Median : 14344   Median : 13996   Median : 15652  
##  Mean   : 72982   Mean   : 75921   Mean   : 73280   Mean   : 76044  
##  3rd Qu.: 71772   3rd Qu.: 76750   3rd Qu.: 74446   3rd Qu.: 76829  
##  Max.   :463475   Max.   :456884   Max.   :441012   Max.   :432160  
##  NA's   :12       NA's   :13       NA's   :13       NA's   :14
new.ttc <-na.omit(ttc) # returns without missing values
is.na(new.ttc) #check to see if it worked
##       FARE MEDIA  2015  2014  2013  2012  2011  2010  2009  2008  2007
##  [1,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [2,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [4,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [5,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [6,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [8,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [9,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##        2006  2005  2004  2003  2002  2001  2000  1999  1998  1997  1996
##  [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##        1995  1994  1993  1992  1991  1990  1989  1988  1987  1986  1985
##  [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
setdiff(ttc, new.ttc)
## # A tibble: 16 x 32
##           `FARE MEDIA` `2015` `2014` `2013` `2012` `2011` `2010` `2009`
##                  <chr>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
##  1               ADULT     NA     NA     NA     NA     NA     NA     NA
##  2             TICKETS     NA     NA     NA     NA     NA   1298   8807
##  3            TWO-FARE     NA     NA     NA     NA     NA     NA     NA
##  4              PRESTO  13323   9862   8194   4399   1139      0      0
##  5 POST-SECONDARY PASS  48396  42855  38426  35019  32091   9200     NA
##  6        TWIN-GO PASS     NA     NA     NA     NA     NA     NA     NA
##  7         WEEKLY PASS   8843   9361   9557  10185   9893   9237   8738
##  8      SENIOR/STUDENT     NA     NA     NA     NA     NA     NA     NA
##  9            TWO-FARE     NA     NA     NA     NA     NA     NA     NA
## 10              PRESTO    438     12     NA     NA     NA     NA     NA
## 11            CHILDREN     NA     NA     NA     NA     NA     NA     NA
## 12          FREE RIDES  10939     NA     NA     NA     NA     NA     NA
## 13              PRESTO     10     NA     NA     NA     NA     NA     NA
## 14     PREMIUM EXPRESS    490    451    401    372    344    322    313
## 15     POSTAL CARRIERS     NA     NA     NA     NA     NA     NA     NA
## 16            GTA PASS   5471   6087   5784   5388   5642   5667   5800
## # ... with 24 more variables: `2008` <int>, `2007` <int>, `2006` <int>,
## #   `2005` <int>, `2004` <int>, `2003` <int>, `2002` <int>, `2001` <int>,
## #   `2000` <int>, `1999` <int>, `1998` <int>, `1997` <int>, `1996` <int>,
## #   `1995` <int>, `1994` <int>, `1993` <int>, `1992` <int>, `1991` <int>,
## #   `1990` <int>, `1989` <int>, `1988` <int>, `1987` <int>, `1986` <int>,
## #   `1985` <int>
#many other solutions but I use these two frequently

Exercise 2. Selecting part of a dataset

Dplyr provides select (columns/vectors), filter (rows/observations), and now pull (returns a vector not df or tibble). Fantastic functions (and beasts) to wrangle out specifics or simplify your dataframe.

survey<-read_csv("data/5081.survey.1.csv")
## Parsed with column specification:
## cols(
##   r.experience = col_integer(),
##   discipline = col_character(),
##   research.data = col_character(),
##   r.studio = col_character()
## )
survey
## # A tibble: 18 x 4
##    r.experience            discipline research.data r.studio
##           <int>                 <chr>         <chr>    <chr>
##  1            2            physiology  quantitative       No
##  2            1              genetics  quantitative      Yes
##  3            2            physiology  quantitative      Yes
##  4            3            physiology  quantitative      Yes
##  5            1               ecology  quantitative       No
##  6            1            physiology  quantitative       No
##  7            3               ecology  quantitative      Yes
##  8            1               ecology   qualitative       No
##  9            1               ecology  quantitative      Yes
## 10            3              genetics  quantitative       No
## 11            3               ecology  quantitative      Yes
## 12            1               ecology  quantitative       No
## 13            1            physiology  quantitative       No
## 14            2            physiology  quantitative      Yes
## 15            1 environmental science  quantitative      Yes
## 16            1            physiology  quantitative       No
## 17            1            physiology  quantitative       No
## 18            2            physiology  quantitative      Yes
#I want just a simple tibble with experience by discipline for students new to rstats in a graduate-level course.

experience <- survey %>% select(discipline, r.experience)
experience
## # A tibble: 18 x 2
##               discipline r.experience
##                    <chr>        <int>
##  1            physiology            2
##  2              genetics            1
##  3            physiology            2
##  4            physiology            3
##  5               ecology            1
##  6            physiology            1
##  7               ecology            3
##  8               ecology            1
##  9               ecology            1
## 10              genetics            3
## 11               ecology            3
## 12               ecology            1
## 13            physiology            1
## 14            physiology            2
## 15 environmental science            1
## 16            physiology            1
## 17            physiology            1
## 18            physiology            2
#Now I just want to select the physiology folks
physiologists <- experience %>% filter(discipline == "physiology")
physiologists
## # A tibble: 9 x 2
##   discipline r.experience
##        <chr>        <int>
## 1 physiology            2
## 2 physiology            2
## 3 physiology            3
## 4 physiology            1
## 5 physiology            1
## 6 physiology            2
## 7 physiology            1
## 8 physiology            1
## 9 physiology            2
#Selections also often include a summary by levels or I want to make a new column with some calculations. Think about what you have likely done in excel.

#used pipes and made a nice summary table
experience <-survey %>% group_by(discipline) %>% summarise(
  count = n(),
  exp = mean (r.experience)
)

Exercise 3. Updated dplyr package

To explore the Star Wars dataset within the tidyverse updated package dplyr 0.7.0 and make a new column. Full details list on its own blog.


library(tidyverse)
data <- starwars #setup as tibble
data #quick look
## # A tibble: 87 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
#simplify species
data <- data %>% mutate(taxa = ifelse(species == "Droid", "machine", "biological"))
data <- data[!is.na(data$taxa), ] #remove NAs from one vector only

#note, ifelses are messy and easy to get really nested and long-winded. Dplyr 0.7.0 release notes suggests considering the function case_when in its stead.

#count up by gender
counts <- data %>% group_by(gender) %>% count()
counts <- counts[!is.na(counts$gender), ] #remove NAs from one vector only
counts$gender <- factor(counts$gender, levels = counts$gender[order(counts$n)])

#I love the count and tally functions! There is also now a new set of similar functions: add_count() and add_tally() for adding an n column within groups


Data viz

#library(plotly)
p <- ggplot(data, aes(height, mass, color = taxa)) + geom_point(size = 3, alpha = 0.5) +
  scale_color_brewer(palette = "Dark2")
#ggplotly(p)
p

p <- ggplot(counts, aes(gender, n)) + geom_bar(stat = "identity", fill = "forestgreen", width = 0.25) + coord_flip()
p


Exercise 4. Two-table wrangle challenge

Data are likely to be distributed amongst different tables and files. The updated package dpyr (and tidy friend packages) provides an excellent set of verbs to address this challenge.

Take a moment and review the brief blog post here describing the two categories of tools. Now, on your own, examine two datasets using the three families of tools (mutating, filtering joins, and set element exploration). Select ONE of the following challenges.

SEAMAP Challenge

seamap <- read_csv("data/BGSREC.csv") #all meta-data listed in the PDF in directory
seamap
## # A tibble: 756,831 x 20
##    BGSID CRUISEID STATIONID VESSEL CRUISE_NO P_STA_NO CATEGORY GENUS_BGS
##    <int>    <int>     <int>  <int>     <int>    <chr>    <int>     <chr>
##  1     1      581         4      4       256    00004        3   RHIZOPR
##  2     2      581         4      4       256    00004        3   SPHYRNA
##  3     3      581         4      4       256    00004        3   HARENGU
##  4     4      581         4      4       256    00004        3   OPISTHO
##  5     5      581         4      4       256    00004        3   SARDINE
##  6     6      581         4      4       256    00004        3    ANCHOA
##  7     7      581         4      4       256    00004        3   SYNODUS
##  8     8      581         4      4       256    00004        3   SPHYRAE
##  9     9      581         4      4       256    00004        3   ALECTIS
## 10    10      581         4      4       256    00004        3    CARANX
## # ... with 756,821 more rows, and 12 more variables: SPEC_BGS <chr>,
## #   BGSCODE <chr>, CNT <int>, CNTEXP <int>, SAMPLE_BGS <dbl>,
## #   SELECT_BGS <dbl>, BIO_BGS <int>, NODC_BGS <int>, IS_SAMPLE <chr>,
## #   TAXONID <chr>, INVRECID <chr>, X20 <chr>
cruises <- read_csv("data/CRUISES.csv")
cruises
## # A tibble: 941 x 12
##    CRUISEID    YR SOURCE VESSEL CRUISE_NO   STARTCRU     ENDCRU
##       <int> <int>  <chr>  <chr>     <chr>     <date>     <date>
##  1        1  1982     US     04       125 1982-02-24 1982-03-31
##  2        2  1982     US     04       126 1982-04-15 1982-05-25
##  3        3  1982     FL     09       821 1982-05-16 1982-05-16
##  4        4  1982     LA     25       821 1982-06-01 1982-07-13
##  5        5  1982     US     04       127 1982-06-01 1982-07-14
##  6        6  1982     MS     17       821 1982-06-06 1982-06-10
##  7        7  1982     AL     23       821 1982-06-08 1982-06-03
##  8        8  1982     FL     09       822 1982-06-08 1982-06-13
##  9        9  1982     US     20       823 1982-06-16 1982-07-06
## 10       10  1982     FL     09       823 1982-06-20 1982-06-22
## # ... with 931 more rows, and 5 more variables: TITLE <chr>, NOTE <int>,
## #   INGEST_SOURCE <chr>, INGEST_PROGRAM_VER <chr>, X12 <chr>

Whole-brain Challenge

brains <- read_csv("data/brains.csv")
brains
## # A tibble: 14 x 8
##       ID analytical practical relational experimental   hair handedness
##    <int>      <int>     <int>      <int>        <int>  <chr>      <chr>
##  1     1          5         4          4            7  brown      right
##  2     2          6         5          5            4  brown      right
##  3     3          2         4          9            5  other      right
##  4     4          6         5          6            3 blonde      right
##  5     5          6         2          5            7 blonde      right
##  6     6          6         3          3            5 blonde      right
##  7     7          4         3          4            6 blonde      right
##  8     8          8         6          4            2  other      right
##  9     9          6         7          3            4  brown      right
## 10    10          4         6          4            6    red      right
## 11    11          5         4          5            6  brown      right
## 12    12          5         5          5            5  brown      right
## 13    13          7         3          5            5  other      right
## 14    14          3         5          5            7  brown       left
## # ... with 1 more variables: languages <int>
ninja.scores <- read_csv("data/ninja.scores.csv")
ninja.scores
## # A tibble: 25 x 2
##       ID mean.ninja
##    <int>      <dbl>
##  1     1   3.142857
##  2     2   2.428571
##  3     3   3.285714
##  4     4   3.428571
##  5     5   2.142857
##  6     6   2.857143
##  7     7   3.000000
##  8     8   3.142857
##  9     9   2.142857
## 10    10   3.571429
## # ... with 15 more rows


Interpretation of the tidyverse

  1. Even a simple datframe can provide adequate complexity for padawan and jedi alike.
  2. Literate coding is embodied with the tidyverse structure.
  3. The code is not verbose but gets the job done.
  4. Just a few keystrokes and a nice code snippet can allow you to surf an ocean of data. The code is not limited by the depth or width of the data. The term wrangling often implies rounding-up individuals into groups for management but really the power of dplyr and the tidyverse is in providing coherent, clear code grammar that works with almost almost all data (distributed, deep, and diverse) provided the structure can be tidied.