Data Management

And Data Repositories

The Need for Data Management: Big Data

The Need for Data Management: Data Deluge

The Need for Data Management: Data Entropy


The Need for Data Management: Public Perception



“The climate scientists at the centre of a media storm over leaked emails were yesterday cleared of accusations that they fudged their results and silenced critics, but a review found they had failed to be open enough about their work.”

Why Manage Data? Researcher Perspective

  • Keep yourself organized – be able to find your files (data inputs, analytic scripts, outputs at various stages of the analytic process, etc.)
  • Track your science processes for reproducibility – be able to match up your outputs with exact inputs and transformations that produced them
  • Better control versions of data – easily identify versions that can be periodically purged
  • Quality control your data more efficiently
  • To avoid data loss (e.g. making backups)
  • Format your data for re-use (by yourself or others)
  • Be prepared: Document your data for your own recollection, accountability, and re-use (by yourself or others)
  • Gain credibility and recognition for your science efforts through data sharing!

Why Manage Data? Advancement of Science

  • Data is a valuable asset – it is expensive and time consuming to collect
  • Data should be managed to:
    • maximize the effective use and value of data and information assets
    • continually improve the quality including: data accuracy, integrity, integration, timeliness of data capture and presentation, relevance, and usefulness
    • ensure appropriate use of data and information
    • facilitate data sharing
    • ensure sustainability and accessibility in long term for re-use in science

The Data Life Cycle


The Data Life Cycle





Barriers to Synthesis

  • Data not preserved
    • Tiny proportion of ecological data are readily available
  • Dispersed, isolated repositories
    • Each community has its own; disconnected; underutilized
  • Lack of software interoperability
    • Metacat, DSpace, Mercury, iRODS, XMCat, OPeNDAP, …
  • Heterogeneous data
    • Many data formats, metadata formats, and varying semantics

Solutions

  • Preserve data
  • Adopt standards
  • Create networks
  • Create interoperable software

Synthesis Channels



Data Repositories

Data Repositories

Arctic Data Center Knowledge Network for Biocomplexity

DRYAD LTER

Data Repositories

Activity

Search for a data repository



Resources

re3data.org

Metadata and Data Heterogeneity

Every community has …

  • many data schemas
    • one for each project and person
  • many data formats
    • ASCII, NetCDF, HDF, GeoTiff, …
  • many metadata schemas
    • Biological Data Profile, Darwin Core, Dublin Core, Ecological Metadata Language (EML), Open GIS schemas, ISO Schemas, …

Accepting this heterogeneity is critical

KNB and the MetaCat Data Server

Diverse Contributors

  • Individual investigators
  • Field stations and networks
  • Agencies, Non-profit partnerships
  • Scientific Societies, Centers

KNB Data Types

  • Ecological
  • Environmental
  • Demographic
  • Social/Legal/Economic

MetaCat

  • Data and metadata management
  • Stores, search, and document data
  • Customizable web-based search interface
  • Web metadata entry tool
  • Replication capabilities
  • DOI Support

Making Data Count

Making Data Count

Making Data Count

KNB and the MetaCat Data Server

Diverse Contributors

  • Individual investigators
  • Field stations and networks
  • Agencies, Non-profit partnerships
  • Scientific Societies, Centers

KNB Data Types

  • Ecological
  • Environmental
  • Demographic
  • Social/Legal/Economic

MetaCat

  • Data and metadata management
  • Stores, search, and document data
  • Customizable web-based search interface
  • Web metadata entry tool
  • Replication capabilities
  • DOI Support

What is Metadata?

Who

What

When

Where

Why

How

What is Metadata?

What are Metadata Good For?

  • Captures Information
  • Enables Discovery
  • Enables Exchange

Metadata Standards

A Standard provides a structure to describe data with:

  • Common terms to allow consistency between records
  • Common definitions for easier interpretation
  • Common language for ease of communication
  • Common structure to quickly locate information

In search and retrieval, standards provide:

  • Documentation structure in a reliable and predictable format for computer interpretation
  • A uniform summary description of the dataset

Many standards exist

  • Biological Data Profile, Darwin Core, Dublin Core, Ecological Metadata Language (EML), Open GIS schemas, ISO Schemas, …

Synthesis Channels



Synthesis Channels



Federated Network of Repositories

Federated Network of Repositories

Diverse Federation: Resilience

  • Failover for temporary outages
  • Insurance against project/institutional failure
  • Avoid correlated failures

Diverse Federation: Scalability

  • Storage increases with Repositories (Member Nodes)
  • Incremental costs to each Member Node to replicate
  • Distributes sustainability costs

Communication across repositories is managed by Coordinating Nodes

DataONE Coordination

DataONE Search

DataONE Search

Activity

Search for data



Resources

search.dataone.org

Activity

Upload data to KNB



Resources

dev.nceas.ucsb.edu

~oss/~oss-lessons/publishing-data/oss_Pond2010_Metadata.pdf

~oss/~oss-lessons/publishing-data/pond2010.csv