Managing and Structuring your Data for Preservation

Managing your data

Data is a valuable asset!!

Data should be managed to:

Maximize the effective use and value of data
Continually improve the quality (accuracy, integrity, integration,…)
Ensure appropriate use of data and information
Facilitate data sharing
Ensure sustainability and accessibility in long term for re-use in science

Data Life Cycle

The data life cycle provides a high level overview of the stages involved in successful management and preservation of data for use and reuse.

Data Management Plans (DMP)

A data management plan describes how you will manage your data during the lifetime of a research project. The process of creating your DMP will force you to think about potential issues related to the project's data that could affect timeline, costs and personnel needed.

Data Management Plans (DMP)

Goal: Answer these 4 questions:

How much data will be collected and how will it be treated?
How much time is needed to manage the data and who will be responsible for doing so?
How long should the data be preserved and where is the best location to do so?
Are there any legal constraints associated with acquiring, using and sharing project data?

from Recknagel and Michener. "Ecological Informatics", 2017

Don't loose your data

Accidents happen !!!

Document: Sooner the Better

Document and preserve your data when you are actively analyzing them!

Recknagel and Michener, 2017

Document Soon, it is also for yourself

You would not have to remember:

The name of that file?
The directory where you put it?
The units those measurements were taken in?
Which sample site was which?
Is it the cleaned version of the data set used for publication?

=> Easier to share with others, good for collaborations!

Not only Data

We mainly have being talking about data, but these rules apply to all the scientific processes and products generated by a research project, including:

Your scientific workflow
Your scripts developed to manipulate and analyse the data
The models the tools you used or developed

Data Modeling for Data Reuse

Learning Objectives

Understand basics of data models
Learn how to design and create effective data tables

Benefits of normalized or "tidy" data

Analitycal:

Powerful search and filtering
Handle large and/or complex data sets
Help to enforce data integrity
Easier to handle data updates

=> Easier to conduct your Analysis and even so for others!!

Benefits of normalized or "tidy" data

Preservation:

Easier to describe
Easier to automate metadata creation
Easier to implement Quality check

Data Heterogeneity

Data are heterogeneous in:

Structure (schema): Logical model of the data (e.g., tables, hierarchical trees, raster images, etc.)
Semantics: Specific meaning of the data (e.g., nature and types of measurements, importance of contextual information, interpretation of record structure, etc.): documentation
Syntax: Digital format of the data (e.g., csv, “R data frame”, NetCDF, Excel XLSX, DBF, etc.)

Why Tabular Data?

Spreadsheets are (still) the primary data entry tool of the digital age!

Spreadsheets

the Good:

Quick on the draw (clickety-click and you’re ready for action)
Always there (on most everyone’s computer)
Smarter than he lets on (stats, pivot tables, VB scripts)
Cleans up real pretty (graphics, fonts, colors, borders)

Spreadsheets

the Bad:

Also a fast shooter (click&fire)
No scruples (delete row, click, ctrl-x/ctrl-c, re-sort, save)
Talks a good story, but unreliable (e.g. http://www.practicalstats.com/xlsstats/excelstats.html)

Spreadsheets

the Ugly:

Ill-mannered: takes your data prisoner; conflates raw data with summary data
Gaudy: Use of visual formatting to indicate critical metadata or other semantic tidbits
Shifty: Cross-linking of worksheets sets up “invisible” dependencies
The more complicated your Spreadsheet, the UGLIER it gets in terms of using it with other software

Spreadsheets

=> Encourages you to mix your data and your analaysis

Data Organization

Multiple tables

Inconsistent observations

Inconsistent variables

Marginal sums and statistics

=> A Spreadsheet is not a table !!

Good enough data modeling

Terminological Soup

Table = Relation = Data set (~ Worksheet)

Column = Variable = Attribute = Characteristic

Row = Record = Tuple <> Observation

Keys are used to Join or Merge

Cell = Value = Measurement

Data Model = Schema

Denormalized data (aka non-tidy)

Observations about different entities combined

Tabular data

Observations. A better way to model data is to organize the observations about each type of entity in its own table. This results in:

Separate tables for each type of entity measured
Each row represents a single observed entity
Observations (rows) are all unique

This is normalized data (aka tidy data)

Tabular data

Variables. In addition, for normalized data, we expect the variables to be organized such that:

All values in a column are of the same type
All columns pertain to the same observed entity (e.g., row)

Tabular data

How to relate tables?

When one has normalized data, we often use unique identifiers to reference particular observations, which allows us to link across tables. Two types of identifiers are common within relational data:

Primary Key: unique identifier for each observed entity, one per row
Foreign Key: reference to a primary key in another table (linkage)

How to relate tables?

Entity-Relationship Model (ER)

An Entity-Relationship model allows us to compactly draw the structure of the tables in a relational database, including the primary and foreign keys in the tables.

In the above model, one can see that each site in the SITES table must have one or more observations in the PLOTOBS table, whereas each PLOTOBS has one and only one SITE.

Simple Guidelines for Effective Data

Design to add rows, not columns
Each column one type
Header line
Non-proprietary formats
Descriptive names
No spaces

Semantic Ambiguity

Column headers:
Avoid cryptic names
Concise, but not meaningful
Units (kg or g?)
Color coding:
avoid using formatting (implicit)
add a column to store this information with a flag

Semantic Ambiguity

Resources used

Recknagel, F., Michener, W.K., 2018. Ecological informatics: data management and knowledge discovery, 3rd ed. ed. Springer, Cham.
DataONE, Data Life Cycle: https://www.dataone.org/data-life-cycle
DataONE data management guide: https://www.dataone.org/sites/all/documents/DataONE-PPSR-DataManagementGuide.pdf
ESIP, Data Management Plans: http://commons.esipfed.org/datamanagementshortcourse (benefits slides were adapted from this material)
Borer, Elizabeth T., Eric W. Seabloom, Matthew B. Jones, and Mark Schildhauer. (2009) "Some Simple Guidelines for Effective Data Management." The Bulletin of the Ecological Society of America 90, no. 2: 205-14. https://doi.org/10.1890/0012-9623-90.2.205.

Resources used

Michener, W. K. (2015). Ten Simple Rules for Creating a Good Data Management Plan. PLoS Comput Biol , 11(10). presented at the 10/2015. https://doi.org/10.1371/journal.pcbi.1004525
Borer et al. 2009. Some Simple Guidelines for Effective Data Management. Bulletin of the Ecological Society of America.
Software Carpentry SQL tutorial
Tidy Data