Getting data is a critical step in most research yet it can sometimes be one of the most difficult and time-consuming steps. This is especially true in synthesis research which may incorporate hundreds of thousands of datasets in the analysis.
I just ran across this last week:
The first report of the Open Research Data Task Force has found that two of the greatest challenges to effectively using open research data are that: even when it is notionally accessible researchers often simply cannot find that data, and if they do find it they cannot use it because of frustrating format variabilities and other compatibility issues.
From: http://www2.warwick.ac.uk/newsandevents/pressreleases/task_force_finds/
Data can come from many sources. On a continuum from least good to most good, we might have:
A really great list of R packages for getting at open data can be found here:
So what is open data? Open data is data that are:
From https://ropensci.org/:
At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists.
Package categories:
Full list of packages: https://ropensci.org/packages/ Many of these are on CRAN and can be installed via install.packages()
but some are not. rOpenSci addresses the issues raised in that top quote.
Let’s go through a few of packages sponsored by rOpenSci to demonstrate the power of open data + APIs + R.
mregions
: Tools to get marine regions data from www.marineregions.orglibrary(mregions)
library(leaflet)
res2 <- mr_shp(key = "MarineRegions:eez_iho_union_v2", maxFeatures = 5)
leaflet() %>%
addProviderTiles(provider = 'OpenStreetMap') %>%
addPolygons(data = res2)
rplos
: R client for the PLoS Journals APIlibrary(rplos)
searchplos(q='everything:"gulf of mexico"', fl='title', fq='doc_type:full', limit=10)
## Warning in flatten_bindable(dots_values(...)): '.Random.seed' is not an
## integer vector but of type 'NULL', so ignored
## $meta
## numFound start maxScore
## 1 1050 0 NA
##
## $data
## title
## 1 Correction: Fish Sound Production in the Presence of Harmful Algal Blooms in the Eastern Gulf of Mexico
## 2 Correction: Trophic Ecology of Atlantic Bluefin Tuna (Thunnusthynnus) Larvae from the Gulf of Mexico and NW Mediterranean Spawning Grounds: A Comparative Stable Isotope Study
## 3 Correction: Horizontal Movements, Migration Patterns, and Population Structure of Whale Sharks in the Gulf of Mexico and Northwestern Caribbean Sea
## 4 Words Analysis of Online Chinese News Headlines about Trending Events: A Complex Network Perspective
## 5 Trophic Ecology of Atlantic Bluefin Tuna (Thunnusthynnus) Larvae from the Gulf of Mexico and NW Mediterranean Spawning Grounds: A Comparative Stable Isotope Study
## 6 First Autonomous Bio-Optical Profiling Float in the Gulf of Mexico Reveals Dynamic Biogeochemistry in Deep Waters
## 7 Genetic Connectivity in Scleractinian Corals across the Northern Gulf of Mexico: Oil/Gas Platforms, and Relationship to the Flower Garden Banks
## 8 Atlantic Bluefin Tuna: A Novel Multistock Spatial Model for Assessing Population Biomass
## 9 Potential Connectivity of Coldwater Black Coral Communities in the Northern Gulf of Mexico
## 10 Temporal and spatial comparisons of the reproductive biology of northern Gulf of Mexico (USA) red snapper (Lutjanus campechanus) collected a decade apart
rnaturalearth
: an R package to hold and facilitate interaction with natural earth map datalibrary(rnaturalearth)
library(sp)
library(ggplot2)
# Plot the countries of the world
plot(ne_countries())
# Get the 110m coastline shapefile and make a plot of the Gulf of Mexico
coastline <- ne_download(scale = 110, type = 'coastline', category = 'physical')
## OGR data source with driver: ESRI Shapefile
## Source: "/tmp/Rtmp1gIIsZ", layer: "ne_110m_coastline"
## with 134 features
## It has 2 fields
## Integer64 fields read as strings: scalerank
ggplot(coastline, aes(long, lat, group = group)) +
geom_path() +
xlim(-120, -50) +
ylim(0, 40)
## Warning: Removed 161 rows containing missing values (geom_path).
rfishbase
: R interface to the fishbase.org databaselibrary(rfishbase)
fish <- common_to_sci("grouper")
species_list <- species(fish)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot(species_list, aes(Length, Weight)) + geom_point()
## Warning: Removed 36 rows containing missing values (geom_point).
species_list %>%
group_by(Genus) %>%
summarize(MeanVulnerability = mean(Vulnerability)) %>%
ggplot() +
geom_col(aes(Genus, MeanVulnerability)) +
coord_flip()
taxize
: A taxonomic toolbelt for Rlibrary(taxize)
##
## Attaching package: 'taxize'
## The following object is masked from 'package:rfishbase':
##
## synonyms
classification("Chironomus riparius", db = "itis")
##
## Retrieving data for taxon 'Chironomus riparius'
## $`Chironomus riparius`
## name rank id
## 1 Animalia kingdom 202423
## 2 Bilateria subkingdom 914154
## 3 Protostomia infrakingdom 914155
## 4 Ecdysozoa superphylum 914158
## 5 Arthropoda phylum 82696
## 6 Hexapoda subphylum 563886
## 7 Insecta class 99208
## 8 Pterygota subclass 100500
## 9 Neoptera infraclass 563890
## 10 Holometabola superorder 914213
## 11 Diptera order 118831
## 12 Nematocera suborder 118832
## 13 Culicomorpha infraorder 125808
## 14 Chironomidae family 127917
## 15 Chironominae subfamily 129228
## 16 Chironomini tribe 129229
## 17 Chironomus genus 129254
## 18 Chironomus riparius species 129313
##
## attr(,"class")
## [1] "classification"
## attr(,"db")
## [1] "itis"
rnoaa
: R interface to many NOAA data APIsAccess data like:
library(rnoaa)
# Go here: http://www.ndbc.noaa.gov/
# Find a station ID, like http://www.ndbc.noaa.gov/station_page.php?station=42039
bd <- buoy(dataset = "cwind", buoyid = 42039, datatype = "cc")
## Using cc2008.nc
plot(bd$data$wind_spd)