Managing your data

Data is a valuable asset!!

Data should be managed to:

  • Maximize the effective use and value of data
  • Continually improve the quality (accuracy, integrity, integration,…)
  • Ensure appropriate use of data and information
  • Facilitate data sharing
  • Ensure sustainability and accessibility in long term for re-use in science

Data Life Cycle

The data life cycle provides a high level overview of the stages involved in successful management and preservation of data for use and reuse.

Data Management Plans (DMP)

A data management plan describes how you will manage your data during the lifetime of a research project. The process of creating your DMP will force you to think about potential issues related to the project's data that could affect timeline, costs and personnel needed.

Data Management Plans (DMP)

Goal: Answer these 4 questions:

  1. How much data will be collected and how will it be treated?
  2. How much time is needed to manage the data and who will be responsible for doing so?
  3. How long should the data be preserved and where is the best location to do so?
  4. Are there any legal constraints associated with acquiring, using and sharing project data?

from Recknagel and Michener. "Ecological Informatics", 2017

Don't loose your data

Accidents happen !!!

Document: Sooner the Better

Document and preserve your data when you are actively analyzing them!

Recknagel and Michener, 2017

Recknagel and Michener, 2017

Document Soon, it is also for yourself

You would not have to remember:

  • The name of that file?
  • The directory where you put it?
  • The units those measurements were taken in?
  • Which sample site was which?
  • Is it the cleaned version of the data set used for publication?

=> Easier to share with others, good for collaborations!

Not only Data

We mainly have being talking about data, but these rules apply to all the scientific processes and products generated by a research project, including:

  • Your scientific workflow
  • Your scripts developed to manipulate and analyse the data
  • The models the tools you used or developed

Data Modeling for Data Reuse

Learning Objectives

  • Understand basics of data models
  • Learn how to design and create effective data tables

Benefits of normalized or "tidy" data

Analitycal:

  • Powerful search and filtering
  • Handle large and/or complex data sets
  • Help to enforce data integrity
  • Easier to handle data updates

=> Easier to conduct your Analysis and even so for others!!

Benefits of normalized or "tidy" data

Preservation:

  • Easier to describe
  • Easier to automate metadata creation
  • Easier to implement Quality check

Data Heterogeneity

Data are heterogeneous in:

  • Structure (schema): Logical model of the data (e.g., tables, hierarchical trees, raster images, etc.)
  • Semantics: Specific meaning of the data (e.g., nature and types of measurements, importance of contextual information, interpretation of record structure, etc.): documentation
  • Syntax: Digital format of the data (e.g., csv, “R data frame”, NetCDF, Excel XLSX, DBF, etc.)

Why Tabular Data?

Spreadsheets are (still) the primary data entry tool of the digital age!