Managing your data

Data is a valuable asset!!

Data should be managed to:

  • Maximize the effective use and value of data
  • Continually improve the quality (accuracy, integrity, integration,…)
  • Ensure appropriate use of data and information
  • Facilitate data sharing
  • Ensure sustainability and accessibility in long term for re-use in science

Data Life Cycle

The data life cycle provides a high level overview of the stages involved in successful management and preservation of data for use and reuse.

Data Management Plans (DMP)

A data management plan describes how you will manage your data during the lifetime of a research project. The process of creating your DMP will force you to think about potential issues related to the project's data that could affect timeline, costs and personnel needed.

Data Management Plans (DMP)

Goal: Answer these 4 questions:

  1. How much data will be collected and how will it be treated?
  2. How much time is needed to manage the data and who will be responsible for doing so?
  3. How long should the data be preserved and where is the best location to do so?
  4. Are there any legal constraints associated with acquiring, using and sharing project data?

from Recknagel and Michener. "Ecological Informatics", 2017

Don't loose your data

Accidents happen !!!

Document: Sooner the Better

Document and preserve your data when you are actively analyzing them!

Recknagel and Michener, 2017

Recknagel and Michener, 2017

Document Soon, it is also for yourself

You would not have to remember:

  • The name of that file?
  • The directory where you put it?
  • The units those measurements were taken in?
  • Which sample site was which?
  • Is it the cleaned version of the data set used for publication?

=> Easier to share with others, good for collaborations!

Not only Data

We mainly have being talking about data, but these rules apply to all the scientific processes and products generated by a research project, including:

  • Your scientific workflow
  • Your scripts developed to manipulate and analyse the data
  • The models the tools you used or developed

Data Modeling for Data Reuse

Learning Objectives

  • Understand basics of data models
  • Learn how to design and create effective data tables

Benefits of normalized or "tidy" data

Analitycal:

  • Powerful search and filtering
  • Handle large and/or complex data sets
  • Help to enforce data integrity
  • Easier to handle data updates

=> Easier to conduct your Analysis and even so for others!!

Benefits of normalized or "tidy" data

Preservation:

  • Easier to describe
  • Easier to automate metadata creation
  • Easier to implement Quality check

Data Heterogeneity

Data are heterogeneous in:

  • Structure (schema): Logical model of the data (e.g., tables, hierarchical trees, raster images, etc.)
  • Semantics: Specific meaning of the data (e.g., nature and types of measurements, importance of contextual information, interpretation of record structure, etc.): documentation
  • Syntax: Digital format of the data (e.g., csv, “R data frame”, NetCDF, Excel XLSX, DBF, etc.)

Why Tabular Data?

Spreadsheets are (still) the primary data entry tool of the digital age!

Spreadsheets

the Good:

  • Quick on the draw (clickety-click and you’re ready for action)
  • Always there (on most everyone’s computer)
  • Smarter than he lets on (stats, pivot tables, VB scripts)
  • Cleans up real pretty (graphics, fonts, colors, borders)

Spreadsheets

Spreadsheets

the Ugly:

  • Ill-mannered: takes your data prisoner; conflates raw data with summary data
  • Gaudy: Use of visual formatting to indicate critical metadata or other semantic tidbits
  • Shifty: Cross-linking of worksheets sets up “invisible” dependencies
  • The more complicated your Spreadsheet, the UGLIER it gets in terms of using it with other software

Spreadsheets

=> Encourages you to mix your data and your analaysis

Data Organization

Multiple tables

Inconsistent observations

Inconsistent variables

Marginal sums and statistics

=> A Spreadsheet is not a table !!

Good enough data modeling

Terminological Soup

Table = Relation = Data set (~ Worksheet)

Column = Variable = Attribute = Characteristic

Row = Record = Tuple <> Observation

Keys are used to Join or Merge

Cell = Value = Measurement

Data Model = Schema

Denormalized data (aka non-tidy)

Observations about different entities combined

Tabular data

Observations. A better way to model data is to organize the observations about each type of entity in its own table. This results in:

  • Separate tables for each type of entity measured
  • Each row represents a single observed entity
  • Observations (rows) are all unique

This is normalized data (aka tidy data)

Tabular data

Variables. In addition, for normalized data, we expect the variables to be organized such that:

  • All values in a column are of the same type
  • All columns pertain to the same observed entity (e.g., row)

Tabular data

How to relate tables?

When one has normalized data, we often use unique identifiers to reference particular observations, which allows us to link across tables. Two types of identifiers are common within relational data:

  • Primary Key: unique identifier for each observed entity, one per row
  • Foreign Key: reference to a primary key in another table (linkage)

How to relate tables?

Entity-Relationship Model (ER)

An Entity-Relationship model allows us to compactly draw the structure of the tables in a relational database, including the primary and foreign keys in the tables.

In the above model, one can see that each site in the SITES table must have one or more observations in the PLOTOBS table, whereas each PLOTOBS has one and only one SITE.

Simple Guidelines for Effective Data

  • Design to add rows, not columns
  • Each column one type
  • Header line
  • Non-proprietary formats
  • Descriptive names
  • No spaces

Semantic Ambiguity

  • Column headers:
  • Avoid cryptic names
  • Concise, but not meaningful
  • Units (kg or g?)
  • Color coding:
  • avoid using formatting (implicit)
  • add a column to store this information with a flag

Semantic Ambiguity

Semantic Ambiguity

Semantic Ambiguity

Semantic Ambiguity

Resources used

Resources used