Philosophy of tidyverse Tidy data make your life easier. Data strutures should match intuition and common sense. Data should have logical structure. Rows are are observations, columns are variables. Tidy data also increase the viability that others can use your data, do better science, reuse science, and help you and your ideas survive and thrive. A workflow should also include the wrangling you did to get your data ready. If data are already very clean in a spreadsheet, they can easily become a literate, logical dataframe. Nonetheless, you should still use annotation within the introductory code to explain the meta-data of your data to some extent and what you did pre-R to get it tidy. The philosophy here is very similar to the data viz lesson forthcoming with two dominant paradigms. Base R code functions, and pipes %>% and the logic embodied within the libraries associated with the the tidyverse. Generally, I prefer the tidyverse because it is more logical and much easier to remember. It also has some specific functions needed to address common errors in modern data structures with little code needed to fix them up.
Data wrangling
Base R key concepts: aggregate tapply sapply lappy subsetting as.factor is.numeric na
tidyverse key concepts: pipes are you best friend! %>%
dplyr filter for rows select for columns mutate for new variables summarise for bringing together many values
Excellent list of wrangling tools
Here are some exercises to address basic challenges you will certainly encounter.
Missing data are typically termed NAs. Missing data are not necessarily the same as true zeros. See Mixed Effects Models and Extension in Ecology with R for an excellent discussion of this. Missing data due to experimental design, observer error, or failures to detect a process are ‘false zeros’ and equivalent to missing data. True negatives (measured and detected absences) are recommended to be coded as a zero.
The read_csv from the tidyverse handles data import well, does not assume factors, and typically takes a parsimonous approach to vector interpretation. The analog from base R is read.csv but comes with many limitations.
#Missing data. In error, missing cells/observations in some measure can kick back an error. In other apps, sometimes ignored but can introduce error.
#setwd('data-wrangling')
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
ttc <-read_csv("data/ttc.csv") #The Toronto Transit Commission Data.
## Parsed with column specification:
## cols(
## .default = col_integer(),
## `FARE MEDIA` = col_character()
## )
## See spec(...) for full column specifications.
ttc #tibble produced if use read_csv versus dataframe. Tibbles include str() list of vector attributes at the top of output.
## # A tibble: 31 x 32
## `FARE MEDIA` `2015` `2014` `2013` `2012` `2011` `2010` `2009`
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 ADULT NA NA NA NA NA NA NA
## 2 TOKENS 110945 111157 112360 117962 124748 120366 114686
## 3 TICKETS NA NA NA NA NA 1298 8807
## 4 TWO-FARE NA NA NA NA NA NA NA
## 5 PRESTO 13323 9862 8194 4399 1139 0 0
## 6 REGULAR MONTHLY PASS 204509 214932 213982 205086 194928 203101 208172
## 7 POST-SECONDARY PASS 48396 42855 38426 35019 32091 9200 NA
## 8 TWIN-GO PASS NA NA NA NA NA NA NA
## 9 WEEKLY PASS 8843 9361 9557 10185 9893 9237 8738
## 10 CASH 48873 49120 48623 46467 43795 43149 41445
## # ... with 21 more rows, and 24 more variables: `2008` <int>,
## # `2007` <int>, `2006` <int>, `2005` <int>, `2004` <int>, `2003` <int>,
## # `2002` <int>, `2001` <int>, `2000` <int>, `1999` <int>, `1998` <int>,
## # `1997` <int>, `1996` <int>, `1995` <int>, `1994` <int>, `1993` <int>,
## # `1992` <int>, `1991` <int>, `1990` <int>, `1989` <int>, `1988` <int>,
## # `1987` <int>, `1986` <int>, `1985` <int>
#check for missing values
is.na(ttc) #returns a logical vector, true is missing (i.e. NA), false is present
## FARE MEDIA 2015 2014 2013 2012 2011 2010 2009 2008 2007
## [1,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [4,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## [8,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [17,] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [21,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [8,] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [17,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [21,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1995 1994 1993 1992 1991 1990 1989 1988 1987 1986 1985
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## [17,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [21,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
summary(ttc, na.rm=TRUE) #excludes NA
## FARE MEDIA 2015 2014 2013
## Length:31 Min. : 10 Min. : 12 Min. : 401
## Class :character 1st Qu.: 1076 1st Qu.: 6087 1st Qu.: 7118
## Mode :character Median : 12170 Median : 10802 Median : 10850
## Mean : 68974 Mean : 75560 Mean : 77843
## 3rd Qu.: 48634 3rd Qu.: 49120 3rd Qu.: 52732
## Max. :534005 Max. :534815 Max. :525194
## NA's :8 NA's :10 NA's :11
## 2012 2011 2010 2009
## Min. : 372 Min. : 344 Min. : 0 Min. : 0
## 1st Qu.: 5141 1st Qu.: 4840 1st Qu.: 2539 1st Qu.: 4747
## Median : 11224 Median : 10690 Median : 9237 Median : 9844
## Mean : 76162 Mean : 74148 Mean : 67353 Mean : 69782
## 3rd Qu.: 51249 3rd Qu.: 49146 3rd Qu.: 43149 3rd Qu.: 46170
## Max. :514007 Max. :500219 Max. :477357 Max. :471233
## NA's :11 NA's :11 NA's :10 NA's :11
## 2008 2007 2006 2005
## Min. : 310 Min. : 295 Min. : 58 Min. : 93
## 1st Qu.: 5334 1st Qu.: 4752 1st Qu.: 3978 1st Qu.: 3261
## Median : 11035 Median : 10892 Median : 10120 Median : 9606
## Mean : 72806 Mean : 71736 Mean : 65906 Mean : 63984
## 3rd Qu.: 49701 3rd Qu.: 62491 3rd Qu.: 61156 3rd Qu.: 63630
## Max. :466700 Max. :459769 Max. :444544 Max. :431220
## NA's :12 NA's :12 NA's :11 NA's :11
## 2004 2003 2002 2001
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 4184 1st Qu.: 3805 1st Qu.: 3264 1st Qu.: 2800
## Median : 9940 Median : 10586 Median : 10404 Median : 10765
## Mean : 65379 Mean : 63412 Mean : 61538 Mean : 62471
## 3rd Qu.: 66028 3rd Qu.: 65337 3rd Qu.: 64830 3rd Qu.: 65670
## Max. :418099 Max. :405412 Max. :415539 Max. :419993
## NA's :12 NA's :12 NA's :11 NA's :11
## 2000 1999 1998 1997
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2558 1st Qu.: 2370 1st Qu.: 2283 1st Qu.: 1834
## Median : 10593 Median : 9655 Median : 9530 Median : 9157
## Mean : 61079 Mean : 58365 Mean : 57806 Mean : 56543
## 3rd Qu.: 64276 3rd Qu.: 62428 3rd Qu.: 64004 3rd Qu.: 65008
## Max. :410558 Max. :392593 Max. :388689 Max. :379883
## NA's :11 NA's :11 NA's :11 NA's :11
## 1996 1995 1994 1993
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1652 1st Qu.: 1976 1st Qu.: 2111 1st Qu.: 2507
## Median : 9098 Median : 11338 Median : 12192 Median : 12544
## Mean : 52806 Mean : 55031 Mean : 55096 Mean : 58668
## 3rd Qu.: 67164 3rd Qu.: 70369 3rd Qu.: 62700 3rd Qu.: 61242
## Max. :372430 Max. :388152 Max. :388252 Max. :393485
## NA's :10 NA's :10 NA's :10 NA's :11
## 1992 1991 1990 1989
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1858 1st Qu.: 2904 1st Qu.: 2914 1st Qu.: 2388
## Median : 11385 Median : 14268 Median : 15488 Median : 15177
## Mean : 60409 Mean : 66712 Mean : 72297 Mean : 70940
## 3rd Qu.: 60610 3rd Qu.: 64236 3rd Qu.: 70048 3rd Qu.: 69216
## Max. :404251 Max. :424167 Max. :459234 Max. :450726
## NA's :11 NA's :12 NA's :12 NA's :12
## 1988 1987 1986 1985
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2525 1st Qu.: 3608 1st Qu.: 3427 1st Qu.: 2519
## Median : 16369 Median : 14344 Median : 13996 Median : 15652
## Mean : 72982 Mean : 75921 Mean : 73280 Mean : 76044
## 3rd Qu.: 71772 3rd Qu.: 76750 3rd Qu.: 74446 3rd Qu.: 76829
## Max. :463475 Max. :456884 Max. :441012 Max. :432160
## NA's :12 NA's :13 NA's :13 NA's :14
new.ttc <-na.omit(ttc) # returns without missing values
is.na(new.ttc) #check to see if it worked
## FARE MEDIA 2015 2014 2013 2012 2011 2010 2009 2008 2007
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1995 1994 1993 1992 1991 1990 1989 1988 1987 1986 1985
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
setdiff(ttc, new.ttc)
## # A tibble: 16 x 32
## `FARE MEDIA` `2015` `2014` `2013` `2012` `2011` `2010` `2009`
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 ADULT NA NA NA NA NA NA NA
## 2 TICKETS NA NA NA NA NA 1298 8807
## 3 TWO-FARE NA NA NA NA NA NA NA
## 4 PRESTO 13323 9862 8194 4399 1139 0 0
## 5 POST-SECONDARY PASS 48396 42855 38426 35019 32091 9200 NA
## 6 TWIN-GO PASS NA NA NA NA NA NA NA
## 7 WEEKLY PASS 8843 9361 9557 10185 9893 9237 8738
## 8 SENIOR/STUDENT NA NA NA NA NA NA NA
## 9 TWO-FARE NA NA NA NA NA NA NA
## 10 PRESTO 438 12 NA NA NA NA NA
## 11 CHILDREN NA NA NA NA NA NA NA
## 12 FREE RIDES 10939 NA NA NA NA NA NA
## 13 PRESTO 10 NA NA NA NA NA NA
## 14 PREMIUM EXPRESS 490 451 401 372 344 322 313
## 15 POSTAL CARRIERS NA NA NA NA NA NA NA
## 16 GTA PASS 5471 6087 5784 5388 5642 5667 5800
## # ... with 24 more variables: `2008` <int>, `2007` <int>, `2006` <int>,
## # `2005` <int>, `2004` <int>, `2003` <int>, `2002` <int>, `2001` <int>,
## # `2000` <int>, `1999` <int>, `1998` <int>, `1997` <int>, `1996` <int>,
## # `1995` <int>, `1994` <int>, `1993` <int>, `1992` <int>, `1991` <int>,
## # `1990` <int>, `1989` <int>, `1988` <int>, `1987` <int>, `1986` <int>,
## # `1985` <int>
#many other solutions but I use these two frequently
Dplyr provides select (columns/vectors), filter (rows/observations), and now pull (returns a vector not df or tibble). Fantastic functions (and beasts) to wrangle out specifics or simplify your dataframe.
survey<-read_csv("data/5081.survey.1.csv")
## Parsed with column specification:
## cols(
## r.experience = col_integer(),
## discipline = col_character(),
## research.data = col_character(),
## r.studio = col_character()
## )
survey
## # A tibble: 18 x 4
## r.experience discipline research.data r.studio
## <int> <chr> <chr> <chr>
## 1 2 physiology quantitative No
## 2 1 genetics quantitative Yes
## 3 2 physiology quantitative Yes
## 4 3 physiology quantitative Yes
## 5 1 ecology quantitative No
## 6 1 physiology quantitative No
## 7 3 ecology quantitative Yes
## 8 1 ecology qualitative No
## 9 1 ecology quantitative Yes
## 10 3 genetics quantitative No
## 11 3 ecology quantitative Yes
## 12 1 ecology quantitative No
## 13 1 physiology quantitative No
## 14 2 physiology quantitative Yes
## 15 1 environmental science quantitative Yes
## 16 1 physiology quantitative No
## 17 1 physiology quantitative No
## 18 2 physiology quantitative Yes
#I want just a simple tibble with experience by discipline for students new to rstats in a graduate-level course.
experience <- survey %>% select(discipline, r.experience)
experience
## # A tibble: 18 x 2
## discipline r.experience
## <chr> <int>
## 1 physiology 2
## 2 genetics 1
## 3 physiology 2
## 4 physiology 3
## 5 ecology 1
## 6 physiology 1
## 7 ecology 3
## 8 ecology 1
## 9 ecology 1
## 10 genetics 3
## 11 ecology 3
## 12 ecology 1
## 13 physiology 1
## 14 physiology 2
## 15 environmental science 1
## 16 physiology 1
## 17 physiology 1
## 18 physiology 2
#Now I just want to select the physiology folks
physiologists <- experience %>% filter(discipline == "physiology")
physiologists
## # A tibble: 9 x 2
## discipline r.experience
## <chr> <int>
## 1 physiology 2
## 2 physiology 2
## 3 physiology 3
## 4 physiology 1
## 5 physiology 1
## 6 physiology 2
## 7 physiology 1
## 8 physiology 1
## 9 physiology 2
#Selections also often include a summary by levels or I want to make a new column with some calculations. Think about what you have likely done in excel.
#used pipes and made a nice summary table
experience <-survey %>% group_by(discipline) %>% summarise(
count = n(),
exp = mean (r.experience)
)
To explore the Star Wars dataset within the tidyverse updated package dplyr 0.7.0 and make a new column. Full details list on its own blog.
library(tidyverse)
data <- starwars #setup as tibble
data #quick look
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color
## <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Skywalker 172 77 blond fair blue
## 2 C-3PO 167 75 <NA> gold yellow
## 3 R2-D2 96 32 <NA> white, blue red
## 4 Darth Vader 202 136 none white yellow
## 5 Leia Organa 150 49 brown light brown
## 6 Owen Lars 178 120 brown, grey light blue
## 7 Beru Whitesun lars 165 75 brown light blue
## 8 R5-D4 97 32 <NA> white, red red
## 9 Biggs Darklighter 183 84 black light brown
## 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## # gender <chr>, homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
#simplify species
data <- data %>% mutate(taxa = ifelse(species == "Droid", "machine", "biological"))
data <- data[!is.na(data$taxa), ] #remove NAs from one vector only
#note, ifelses are messy and easy to get really nested and long-winded. Dplyr 0.7.0 release notes suggests considering the function case_when in its stead.
#count up by gender
counts <- data %>% group_by(gender) %>% count()
counts <- counts[!is.na(counts$gender), ] #remove NAs from one vector only
counts$gender <- factor(counts$gender, levels = counts$gender[order(counts$n)])
#I love the count and tally functions! There is also now a new set of similar functions: add_count() and add_tally() for adding an n column within groups
Data viz
#library(plotly)
p <- ggplot(data, aes(height, mass, color = taxa)) + geom_point(size = 3, alpha = 0.5) +
scale_color_brewer(palette = "Dark2")
#ggplotly(p)
p
p <- ggplot(counts, aes(gender, n)) + geom_bar(stat = "identity", fill = "forestgreen", width = 0.25) + coord_flip()
p
Data are likely to be distributed amongst different tables and files. The updated package dpyr (and tidy friend packages) provides an excellent set of verbs to address this challenge.
Take a moment and review the brief blog post here describing the two categories of tools. Now, on your own, examine two datasets using the three families of tools (mutating, filtering joins, and set element exploration). Select ONE of the following challenges.
SEAMAP Challenge
seamap <- read_csv("data/BGSREC.csv") #all meta-data listed in the PDF in directory
seamap
## # A tibble: 756,831 x 20
## BGSID CRUISEID STATIONID VESSEL CRUISE_NO P_STA_NO CATEGORY GENUS_BGS
## <int> <int> <int> <int> <int> <chr> <int> <chr>
## 1 1 581 4 4 256 00004 3 RHIZOPR
## 2 2 581 4 4 256 00004 3 SPHYRNA
## 3 3 581 4 4 256 00004 3 HARENGU
## 4 4 581 4 4 256 00004 3 OPISTHO
## 5 5 581 4 4 256 00004 3 SARDINE
## 6 6 581 4 4 256 00004 3 ANCHOA
## 7 7 581 4 4 256 00004 3 SYNODUS
## 8 8 581 4 4 256 00004 3 SPHYRAE
## 9 9 581 4 4 256 00004 3 ALECTIS
## 10 10 581 4 4 256 00004 3 CARANX
## # ... with 756,821 more rows, and 12 more variables: SPEC_BGS <chr>,
## # BGSCODE <chr>, CNT <int>, CNTEXP <int>, SAMPLE_BGS <dbl>,
## # SELECT_BGS <dbl>, BIO_BGS <int>, NODC_BGS <int>, IS_SAMPLE <chr>,
## # TAXONID <chr>, INVRECID <chr>, X20 <chr>
cruises <- read_csv("data/CRUISES.csv")
cruises
## # A tibble: 941 x 12
## CRUISEID YR SOURCE VESSEL CRUISE_NO STARTCRU ENDCRU
## <int> <int> <chr> <chr> <chr> <date> <date>
## 1 1 1982 US 04 125 1982-02-24 1982-03-31
## 2 2 1982 US 04 126 1982-04-15 1982-05-25
## 3 3 1982 FL 09 821 1982-05-16 1982-05-16
## 4 4 1982 LA 25 821 1982-06-01 1982-07-13
## 5 5 1982 US 04 127 1982-06-01 1982-07-14
## 6 6 1982 MS 17 821 1982-06-06 1982-06-10
## 7 7 1982 AL 23 821 1982-06-08 1982-06-03
## 8 8 1982 FL 09 822 1982-06-08 1982-06-13
## 9 9 1982 US 20 823 1982-06-16 1982-07-06
## 10 10 1982 FL 09 823 1982-06-20 1982-06-22
## # ... with 931 more rows, and 5 more variables: TITLE <chr>, NOTE <int>,
## # INGEST_SOURCE <chr>, INGEST_PROGRAM_VER <chr>, X12 <chr>
Whole-brain Challenge
brains <- read_csv("data/brains.csv")
brains
## # A tibble: 14 x 8
## ID analytical practical relational experimental hair handedness
## <int> <int> <int> <int> <int> <chr> <chr>
## 1 1 5 4 4 7 brown right
## 2 2 6 5 5 4 brown right
## 3 3 2 4 9 5 other right
## 4 4 6 5 6 3 blonde right
## 5 5 6 2 5 7 blonde right
## 6 6 6 3 3 5 blonde right
## 7 7 4 3 4 6 blonde right
## 8 8 8 6 4 2 other right
## 9 9 6 7 3 4 brown right
## 10 10 4 6 4 6 red right
## 11 11 5 4 5 6 brown right
## 12 12 5 5 5 5 brown right
## 13 13 7 3 5 5 other right
## 14 14 3 5 5 7 brown left
## # ... with 1 more variables: languages <int>
ninja.scores <- read_csv("data/ninja.scores.csv")
ninja.scores
## # A tibble: 25 x 2
## ID mean.ninja
## <int> <dbl>
## 1 1 3.142857
## 2 2 2.428571
## 3 3 3.285714
## 4 4 3.428571
## 5 5 2.142857
## 6 6 2.857143
## 7 7 3.000000
## 8 8 3.142857
## 9 9 2.142857
## 10 10 3.571429
## # ... with 15 more rows