Introduction

Getting data is a critical step in most research yet it can sometimes be one of the most difficult and time-consuming steps. This is especially true in synthesis research which may incorporate hundreds of thousands of datasets in the analysis.

I just ran across this last week:

The first report of the Open Research Data Task Force has found that two of the greatest challenges to effectively using open research data are that: even when it is notionally accessible researchers often simply cannot find that data, and if they do find it they cannot use it because of frustrating format variabilities and other compatibility issues.

From: http://www2.warwick.ac.uk/newsandevents/pressreleases/task_force_finds/

Learning Outcomes

Open data

Data can come from many sources. On a continuum from least good to most good, we might have:

A really great list of R packages for getting at open data can be found here:

So what is open data? Open data is data that are:

What is rOpenSci?

From https://ropensci.org/:

At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists.

Package categories:

Full list of packages: https://ropensci.org/packages/ Many of these are on CRAN and can be installed via install.packages() but some are not. rOpenSci addresses the issues raised in that top quote.

Overview of some of the interetsing packages rOpenSci provides

Let’s go through a few of packages sponsored by rOpenSci to demonstrate the power of open data + APIs + R.

mregions: Tools to get marine regions data from www.marineregions.org

library(mregions)
library(leaflet)

res2 <- mr_shp(key = "MarineRegions:eez_iho_union_v2", maxFeatures = 5)

leaflet() %>%
  addProviderTiles(provider = 'OpenStreetMap') %>%
  addPolygons(data = res2)

rplos: R client for the PLoS Journals API

library(rplos)
searchplos(q='everything:"gulf of mexico"', fl='title', fq='doc_type:full', limit=10)
## Warning in flatten_bindable(dots_values(...)): '.Random.seed' is not an
## integer vector but of type 'NULL', so ignored
## $meta
##   numFound start maxScore
## 1     1050     0       NA
## 
## $data
##                                                                                                                                                                             title
## 1                                                                         Correction: Fish Sound Production in the Presence of Harmful Algal Blooms in the Eastern Gulf of Mexico
## 2  Correction: Trophic Ecology of Atlantic Bluefin Tuna (Thunnusthynnus) Larvae from the Gulf of Mexico and NW Mediterranean Spawning Grounds: A Comparative Stable Isotope Study
## 3                             Correction: Horizontal Movements, Migration Patterns, and Population Structure of Whale Sharks in the Gulf of Mexico and Northwestern Caribbean Sea
## 4                                                                            Words Analysis of Online Chinese News Headlines about Trending Events: A Complex Network Perspective
## 5              Trophic Ecology of Atlantic Bluefin Tuna (Thunnusthynnus) Larvae from the Gulf of Mexico and NW Mediterranean Spawning Grounds: A Comparative Stable Isotope Study
## 6                                                               First Autonomous Bio-Optical Profiling Float in the Gulf of Mexico Reveals Dynamic Biogeochemistry in Deep Waters
## 7                                 Genetic Connectivity in Scleractinian Corals across the Northern Gulf of Mexico: Oil/Gas Platforms, and Relationship to the Flower Garden Banks
## 8                                                                                        Atlantic Bluefin Tuna: A Novel Multistock Spatial Model for Assessing Population Biomass
## 9                                                                                      Potential Connectivity of Coldwater Black Coral Communities in the Northern Gulf of Mexico
## 10                      Temporal and spatial comparisons of the reproductive biology of northern Gulf of Mexico (USA) red snapper (Lutjanus campechanus) collected a decade apart

rnaturalearth: an R package to hold and facilitate interaction with natural earth map data

library(rnaturalearth)
library(sp)
library(ggplot2)

# Plot the countries of the world
plot(ne_countries())

# Get the 110m coastline shapefile and make a plot of the Gulf of Mexico
coastline <- ne_download(scale = 110, type = 'coastline', category = 'physical')
## OGR data source with driver: ESRI Shapefile 
## Source: "/tmp/Rtmp1gIIsZ", layer: "ne_110m_coastline"
## with 134 features
## It has 2 fields
## Integer64 fields read as strings:  scalerank
ggplot(coastline, aes(long, lat, group = group)) + 
  geom_path() +
  xlim(-120, -50) +
  ylim(0, 40)
## Warning: Removed 161 rows containing missing values (geom_path).

rfishbase: R interface to the fishbase.org database

library(rfishbase)

fish <- common_to_sci("grouper")

species_list <- species(fish)

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ggplot(species_list, aes(Length, Weight)) + geom_point()
## Warning: Removed 36 rows containing missing values (geom_point).

species_list %>%
  group_by(Genus) %>% 
  summarize(MeanVulnerability = mean(Vulnerability)) %>% 
  ggplot() + 
  geom_col(aes(Genus, MeanVulnerability)) + 
  coord_flip()

taxize: A taxonomic toolbelt for R

library(taxize)
## 
## Attaching package: 'taxize'
## The following object is masked from 'package:rfishbase':
## 
##     synonyms
classification("Chironomus riparius", db = "itis")
## 
## Retrieving data for taxon 'Chironomus riparius'
## $`Chironomus riparius`
##                   name         rank     id
## 1             Animalia      kingdom 202423
## 2            Bilateria   subkingdom 914154
## 3          Protostomia infrakingdom 914155
## 4            Ecdysozoa  superphylum 914158
## 5           Arthropoda       phylum  82696
## 6             Hexapoda    subphylum 563886
## 7              Insecta        class  99208
## 8            Pterygota     subclass 100500
## 9             Neoptera   infraclass 563890
## 10        Holometabola   superorder 914213
## 11             Diptera        order 118831
## 12          Nematocera     suborder 118832
## 13        Culicomorpha   infraorder 125808
## 14        Chironomidae       family 127917
## 15        Chironominae    subfamily 129228
## 16         Chironomini        tribe 129229
## 17          Chironomus        genus 129254
## 18 Chironomus riparius      species 129313
## 
## attr(,"class")
## [1] "classification"
## attr(,"db")
## [1] "itis"

rnoaa: R interface to many NOAA data APIs

Access data like:

  • Air temps
  • Sea ice extent
  • Buoy data
  • Tons more!
library(rnoaa)

# Go here: http://www.ndbc.noaa.gov/
# Find a station ID, like http://www.ndbc.noaa.gov/station_page.php?station=42039
bd <- buoy(dataset = "cwind",  buoyid = 42039, datatype = "cc")
## Using cc2008.nc
plot(bd$data$wind_spd)

Summary

Resources