Introduction

Science metadata is:

Science metadata underpins data repositories:

The R EML package aims to help us generate EML science metadata from within R. The learning process here is two-fold:

Both of these are relatively hard!

Today, I’ll show a bit of EML and a bit of the EML R package in the hopes that when it comes time to create your own science metadata you’ll know where to look for guidance.

Learning outcomes:

Upon completing this module, students will

Lesson

The EML standard

EML covers lots of stuff, importantly:

  • Proper citation of your dataset
  • Who is involved with the dataset and how
  • Coverage (temporal, spatial, taxonomic)
  • Methodological information
  • Documentation on files and their formats

Examples:

Generating an EML record from scratch

This is modeled after this vignette inside the EML package: https://github.com/ropensci/EML/blob/master/vignettes/creating-EML.Rmd

The dataset

As an example, let’s create an EML record for the iris dataset that comes with the ggplot2 package.

library(ggplot2)
data("iris") # requires ggplot2
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

The metadata

First we load the EML package:

library(EML)

The easiest way to create an EML record from scratch is to get the information into R first, then create the EML record with that information.

So let’s start with the title and abstract:

title <- "Edgar Anderson's Iris Data"

The easiest way to set an abstract is to create a separate Markdown file which lets us use rich formatting:

abstract <- as(set_TextType("./abstract.md"), "abstract")

Though not required, including licensing information is a crucial step in metadata authoring. Let’s use the Create Commons Attribute license which is a permissive license.

intellectualRights <- "This work is licensed under a Creative Commons Attribution 4.0 International License."

Every dataset should have a publication date. I can guess at the most appropriate publication date from the ?iris help page.

pubDate <- "1935"

Keywords

Search systems often take advantage of keywords to make it easier to find what you’re looking for and find related datasets.

keywordSet <-
  c(new("keywordSet",
        keyword = c("iris",
                    "ra fisher",
                    "setosa",
                    "virginica",
                    "versicolor")))
keywordSet
## An object of class "ListOfkeywordSet"
## [[1]]
## <keywordSet>
##   <keyword>iris</keyword>
##   <keyword>ra fisher</keyword>
##   <keyword>setosa</keyword>
##   <keyword>virginica</keyword>
##   <keyword>versicolor</keyword>
## </keywordSet>

Parties

Every EML record needs to have a creator and a contact set. The creator is the party or parties (e.g., person, organization) that should be cited when giving credit for the dataset.

edgar <- as.person("Edgar Anderson <edgaranderson@iris.net>") # Fake email
creator <- as(edgar, "creator")
## Warning: Person Edgar Anderson <edgaranderson@iris.net> was not given any
## role.
contact <- as(edgar, "contact")
## Warning: Person Edgar Anderson <edgaranderson@iris.net> was not given any
## role.

Methods

We don’t have detailed methods for this dataset so we’ll make something up. The easiest way to get methods into an EML record is to create a separate Markdown file which lets us get rich formatting.

methods <- set_methods("methods.md")
methods
## <methods>
##   <methodStep>
##     <description>
##       <section>
##         <title>First step</title>
##         <para>
##     First we set up the experiment
##   </para>
##       </section>
##       <section>
##         <title>Second step</title>
##         <para>
##     Then we observed the experiment
##   </para>
##       </section>
##       <section>
##         <title>Third step</title>
##         <para>
##     Finally we analyzed the results
##   </para>
##       </section>
##     </description>
##   </methodStep>
## </methods>

Coverage

We also don’t have detailed coverage information but we can fill some things in from a bit of research.

coverage <- 
  set_coverage(beginDate = '1936-01-01', 
               endDate = '1936-12-31', # Fake tempporal information
               sci_names = c("Iris setosa", "Iris versicolor", "Iris virginica"),
               geographicDescription = "Gaspé Peninsula", # Approximated spatial coverage
               westBoundingCoordinate = -65.75, 
               eastBoundingCoordinate = -65.75,
               northBoundingCoordinate = 48.66, 
               southBoundingCoordinate = 48.66)
coverage
## <coverage system="uuid">
##   <geographicCoverage>
##     <geographicDescription>Gaspé Peninsula</geographicDescription>
##     <boundingCoordinates>
##       <westBoundingCoordinate>-65.75</westBoundingCoordinate>
##       <eastBoundingCoordinate>-65.75</eastBoundingCoordinate>
##       <northBoundingCoordinate>48.66</northBoundingCoordinate>
##       <southBoundingCoordinate>48.66</southBoundingCoordinate>
##     </boundingCoordinates>
##   </geographicCoverage>
##   <temporalCoverage>
##     <rangeOfDates>
##       <beginDate>
##         <calendarDate>1936-01-01</calendarDate>
##       </beginDate>
##       <endDate>
##         <calendarDate>1936-12-31</calendarDate>
##       </endDate>
##     </rangeOfDates>
##   </temporalCoverage>
##   <taxonomicCoverage>
##     <taxonomicClassification>
##       <taxonRankName>genus</taxonRankName>
##       <taxonRankValue>Iris</taxonRankValue>
##       <taxonomicClassification>
##         <taxonRankName>species</taxonRankName>
##         <taxonRankValue>Iris setosa</taxonRankValue>
##       </taxonomicClassification>
##     </taxonomicClassification>
##     <taxonomicClassification>
##       <taxonRankName>genus</taxonRankName>
##       <taxonRankValue>Iris</taxonRankValue>
##       <taxonomicClassification>
##         <taxonRankName>species</taxonRankName>
##         <taxonRankValue>Iris versicolor</taxonRankValue>
##       </taxonomicClassification>
##     </taxonomicClassification>
##     <taxonomicClassification>
##       <taxonRankName>genus</taxonRankName>
##       <taxonRankValue>Iris</taxonRankValue>
##       <taxonomicClassification>
##         <taxonRankName>species</taxonRankName>
##         <taxonRankValue>Iris virginica</taxonRankValue>
##       </taxonomicClassification>
##     </taxonomicClassification>
##   </taxonomicCoverage>
## </coverage>

Attributes

Attributes are one of the more powerful parts of EML. We can describe, in very specific detail, the meaning of the tabular data we’re documenting. A lot of information is required to sufficiently describe datasets so we’ll have to enter in a fair bit of information. The easiest way to do that is to create a separate CSV file with a set of columns that the EML package is looking for and bring it in as a data.frame.

attributes <- read.csv("attributes.csv")

# For the Species column, we need to define the values as codes and we need
# to tell EML what they mean
species_codes <- c("setosa" = "Iris setosa",
                   "virginica" = "Iris virginica",
                   "versicolor" = "Iris versicolo")

factors <- data.frame(attributeName = "Species",
                      code = names(species_codes),
                      definition = species_codes)

attributeList <- set_attributes(attributes, 
                                factors,
                                col_classes = c("numeric",
                                                "numeric",
                                                "numeric",
                                                "numeric",
                                                "factor"))
attributeList
## <attributeList>
##   <attribute>
##     <attributeName>Sepal.Length</attributeName>
##     <attributeDefinition>Length of the sepal as measured using methodology X</attributeDefinition>
##     <storageType>float</storageType>
##     <measurementScale>
##       <ratio>
##         <unit>
##           <standardUnit>millimeter</standardUnit>
##         </unit>
##         <numericDomain>
##           <numberType>real</numberType>
##         </numericDomain>
##       </ratio>
##     </measurementScale>
##   </attribute>
##   <attribute>
##     <attributeName>Sepal.Width</attributeName>
##     <attributeDefinition>Width of the sepal as measured using methodology X</attributeDefinition>
##     <storageType>float</storageType>
##     <measurementScale>
##       <ratio>
##         <unit>
##           <standardUnit>millimeter</standardUnit>
##         </unit>
##         <numericDomain>
##           <numberType>real</numberType>
##         </numericDomain>
##       </ratio>
##     </measurementScale>
##   </attribute>
##   <attribute>
##     <attributeName>Petal.Length</attributeName>
##     <attributeDefinition>Length of the petal as measured using methodology X</attributeDefinition>
##     <storageType>float</storageType>
##     <measurementScale>
##       <ratio>
##         <unit>
##           <standardUnit>millimeter</standardUnit>
##         </unit>
##         <numericDomain>
##           <numberType>real</numberType>
##         </numericDomain>
##       </ratio>
##     </measurementScale>
##   </attribute>
##   <attribute>
##     <attributeName>Petal.Width</attributeName>
##     <attributeDefinition>Width of the petal as measured using methodology X</attributeDefinition>
##     <storageType>float</storageType>
##     <measurementScale>
##       <ratio>
##         <unit>
##           <standardUnit>millimeter</standardUnit>
##         </unit>
##         <numericDomain>
##           <numberType>real</numberType>
##         </numericDomain>
##       </ratio>
##     </measurementScale>
##   </attribute>
##   <attribute>
##     <attributeName>Species</attributeName>
##     <attributeDefinition>Species measured</attributeDefinition>
##     <storageType>string</storageType>
##     <measurementScale>
##       <nominal>
##         <nonNumericDomain>
##           <enumeratedDomain>
##             <codeDefinition>
##               <code>setosa</code>
##               <definition>Iris setosa</definition>
##             </codeDefinition>
##             <codeDefinition>
##               <code>virginica</code>
##               <definition>Iris virginica</definition>
##             </codeDefinition>
##             <codeDefinition>
##               <code>versicolor</code>
##               <definition>Iris versicolo</definition>
##             </codeDefinition>
##           </enumeratedDomain>
##         </nonNumericDomain>
##       </nominal>
##     </measurementScale>
##   </attribute>
## </attributeList>

Entities

We’ve described the attributes (columns) for iris.csv but we haven’t describe iris.csv itself. In EML, files like this are called entities and entities contain information about their file formats and more.

write.csv(iris, row.names = FALSE, "iris.csv")
physical <- set_physical("iris.csv", 
                         size = as.character(file.size("iris.csv")),
                         authentication = digest::digest("iris.csv", algo = "md5", file = TRUE),
                         authMethod = "MD5")
physical
## <physical system="uuid">
##   <objectName>iris.csv</objectName>
##   <size unit="bytes">4026</size>
##   <authentication method="MD5">5fe92fe6a2c1928ef5a67b8939fdaf8d</authentication>
##   <dataFormat>
##     <textFormat>
##       <recordDelimiter>\n\r</recordDelimiter>
##       <attributeOrientation>column</attributeOrientation>
##       <simpleDelimited>
##         <fieldDelimiter>,</fieldDelimiter>
##       </simpleDelimited>
##     </textFormat>
##   </dataFormat>
## </physical>

Because iris.csv is tabular, we create an entity of type dataTable:

dataTable <- new("dataTable",
                 entityName = "iris.csv",
                 entityDescription = "Edgar Anderosn's Iris data exported from R",
                 physical = physical,
                 attributeList = attributeList)

Note that the attributeList we created before gets entered directly into the dataTable entity.

Create the eml object

Now that we have everything all entered into R, we can create the

dataset <- new("dataset",
               title = title,
               creator = edgar,
               pubDate = pubDate,
               intellectualRights = intellectualRights,
               abstract = abstract,
               keywordSet = keywordSet,
               coverage = coverage,
               contact = contact,
               methods = methods,
               dataTable = dataTable)
## Warning: Person Edgar Anderson <edgaranderson@iris.net> was not given any
## role.
eml <- new("eml",
           packageId = paste0("urn:uuid", uuid::UUIDgenerate()),
           system = "uuid",
           dataset = dataset)

Save and validate

Now that our eml object is created, we can save it:

write_eml(eml, "eml.xml")

We should also validate the file:

eml_validate("eml.xml")
## [1] TRUE
## attr(,"errors")
## character(0)

Summary

Resources