Data Management and Repositories

The Need for Data Management: Big Data

Deta Deluge

Why Manage Data? Advancement of Science

  • Data is a valuable asset – it is expensive and time consuming to collect
  • Data should be managed to:
    • maximize the effective use and value of data and information assets
    • continually improve the quality including: data accuracy, integrity, integration, timeliness of data capture
    • ensure appropriate use of data and information
    • ensure sustainability and accessibility in long term for re-use in science

The Need for Data Management: Public Perception



“The climate scientists at the centre of a media storm over leaked emails were yesterday cleared of accusations that they fudged their results and silenced critics, but a review found they had failed to be open enough about their work.”

Why Manage Data? Researcher Perspective

  • Keep yourself organized – be able to find your files
  • Track your science processes for reproducibility
  • Quality control your data more efficiently
  • To avoid data loss (e.g. making backups)
  • Gain credibility and recognition for your science efforts through data sharing!

The Need for Data Management: Data Entropy

Michener et al 1997; Vines et al 2014

The Data Life Cycle

Data Reuse

Barriers to Data Reuse

  • Data not preserved
    • Tiny proportion of ecological data are readily available
  • Dispersed, isolated repositories
    • Each community has its own; disconnected; underutilized
  • Lack of software interoperability
    • Metacat, DSpace, Mercury, iRODS, XMCat, OPeNDAP, …
  • Heterogeneous data
    • Many data formats, metadata formats, and varying semantics

Solutions

  • Preserve data
  • Adopt standards
  • Create networks
  • Create interoperable software

Preserving Data

  • Datasets are preserved with long-term commitment
  • Datasets are versioned and citeable
  • Datasets are searchable and discoverable

What is a data repository?

System LongTerm Versioned Citable Discoverable
Google Drive maybe maybe no no
GitHub yes yes no no
University Server maybe no no maybe
KNB yes yes yes yes

Data Repositories

Data Repositories