Science metadata is:
Science metadata underpins data repositories:
The R EML package aims to help us generate EML science metadata from within R. The learning process here is two-fold:
Both of these are relatively hard!
Today, I’ll show a bit of EML and a bit of the EML R package in the hopes that when it comes time to create your own science metadata you’ll know where to look for guidance.
Upon completing this module, students will
EML covers lots of stuff, importantly:
Examples:
This is modeled after this vignette inside the EML package: https://github.com/ropensci/EML/blob/master/vignettes/creating-EML.Rmd
As an example, let’s create an EML record for the iris
dataset that comes with the ggplot2
package.
library(ggplot2)
data("iris") # requires ggplot2
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
First we load the EML package:
library(EML)
The easiest way to create an EML record from scratch is to get the information into R first, then create the EML record with that information.
So let’s start with the title and abstract:
title <- "Edgar Anderson's Iris Data"
The easiest way to set an abstract is to create a separate Markdown file which lets us use rich formatting:
abstract <- as(set_TextType("./abstract.md"), "abstract")
Though not required, including licensing information is a crucial step in metadata authoring. Let’s use the Create Commons Attribute license which is a permissive license.
intellectualRights <- "This work is licensed under a Creative Commons Attribution 4.0 International License."
Every dataset should have a publication date. I can guess at the most appropriate publication date from the ?iris help page.
pubDate <- "1935"
Search systems often take advantage of keywords to make it easier to find what you’re looking for and find related datasets.
keywordSet <-
c(new("keywordSet",
keyword = c("iris",
"ra fisher",
"setosa",
"virginica",
"versicolor")))
keywordSet
## An object of class "ListOfkeywordSet"
## [[1]]
## <keywordSet>
## <keyword>iris</keyword>
## <keyword>ra fisher</keyword>
## <keyword>setosa</keyword>
## <keyword>virginica</keyword>
## <keyword>versicolor</keyword>
## </keywordSet>
Every EML record needs to have a creator and a contact set. The creator is the party or parties (e.g., person, organization) that should be cited when giving credit for the dataset.
edgar <- as.person("Edgar Anderson <edgaranderson@iris.net>") # Fake email
creator <- as(edgar, "creator")
## Warning: Person Edgar Anderson <edgaranderson@iris.net> was not given any
## role.
contact <- as(edgar, "contact")
## Warning: Person Edgar Anderson <edgaranderson@iris.net> was not given any
## role.
We don’t have detailed methods for this dataset so we’ll make something up. The easiest way to get methods into an EML record is to create a separate Markdown file which lets us get rich formatting.
methods <- set_methods("methods.md")
methods
## <methods>
## <methodStep>
## <description>
## <section>
## <title>First step</title>
## <para>
## First we set up the experiment
## </para>
## </section>
## <section>
## <title>Second step</title>
## <para>
## Then we observed the experiment
## </para>
## </section>
## <section>
## <title>Third step</title>
## <para>
## Finally we analyzed the results
## </para>
## </section>
## </description>
## </methodStep>
## </methods>
We also don’t have detailed coverage information but we can fill some things in from a bit of research.
coverage <-
set_coverage(beginDate = '1936-01-01',
endDate = '1936-12-31', # Fake tempporal information
sci_names = c("Iris setosa", "Iris versicolor", "Iris virginica"),
geographicDescription = "Gaspé Peninsula", # Approximated spatial coverage
westBoundingCoordinate = -65.75,
eastBoundingCoordinate = -65.75,
northBoundingCoordinate = 48.66,
southBoundingCoordinate = 48.66)
coverage
## <coverage system="uuid">
## <geographicCoverage>
## <geographicDescription>Gaspé Peninsula</geographicDescription>
## <boundingCoordinates>
## <westBoundingCoordinate>-65.75</westBoundingCoordinate>
## <eastBoundingCoordinate>-65.75</eastBoundingCoordinate>
## <northBoundingCoordinate>48.66</northBoundingCoordinate>
## <southBoundingCoordinate>48.66</southBoundingCoordinate>
## </boundingCoordinates>
## </geographicCoverage>
## <temporalCoverage>
## <rangeOfDates>
## <beginDate>
## <calendarDate>1936-01-01</calendarDate>
## </beginDate>
## <endDate>
## <calendarDate>1936-12-31</calendarDate>
## </endDate>
## </rangeOfDates>
## </temporalCoverage>
## <taxonomicCoverage>
## <taxonomicClassification>
## <taxonRankName>genus</taxonRankName>
## <taxonRankValue>Iris</taxonRankValue>
## <taxonomicClassification>
## <taxonRankName>species</taxonRankName>
## <taxonRankValue>Iris setosa</taxonRankValue>
## </taxonomicClassification>
## </taxonomicClassification>
## <taxonomicClassification>
## <taxonRankName>genus</taxonRankName>
## <taxonRankValue>Iris</taxonRankValue>
## <taxonomicClassification>
## <taxonRankName>species</taxonRankName>
## <taxonRankValue>Iris versicolor</taxonRankValue>
## </taxonomicClassification>
## </taxonomicClassification>
## <taxonomicClassification>
## <taxonRankName>genus</taxonRankName>
## <taxonRankValue>Iris</taxonRankValue>
## <taxonomicClassification>
## <taxonRankName>species</taxonRankName>
## <taxonRankValue>Iris virginica</taxonRankValue>
## </taxonomicClassification>
## </taxonomicClassification>
## </taxonomicCoverage>
## </coverage>
Attributes are one of the more powerful parts of EML. We can describe, in very specific detail, the meaning of the tabular data we’re documenting. A lot of information is required to sufficiently describe datasets so we’ll have to enter in a fair bit of information. The easiest way to do that is to create a separate CSV file with a set of columns that the EML package is looking for and bring it in as a data.frame
.
attributes <- read.csv("attributes.csv")
# For the Species column, we need to define the values as codes and we need
# to tell EML what they mean
species_codes <- c("setosa" = "Iris setosa",
"virginica" = "Iris virginica",
"versicolor" = "Iris versicolo")
factors <- data.frame(attributeName = "Species",
code = names(species_codes),
definition = species_codes)
attributeList <- set_attributes(attributes,
factors,
col_classes = c("numeric",
"numeric",
"numeric",
"numeric",
"factor"))
attributeList
## <attributeList>
## <attribute>
## <attributeName>Sepal.Length</attributeName>
## <attributeDefinition>Length of the sepal as measured using methodology X</attributeDefinition>
## <storageType>float</storageType>
## <measurementScale>
## <ratio>
## <unit>
## <standardUnit>millimeter</standardUnit>
## </unit>
## <numericDomain>
## <numberType>real</numberType>
## </numericDomain>
## </ratio>
## </measurementScale>
## </attribute>
## <attribute>
## <attributeName>Sepal.Width</attributeName>
## <attributeDefinition>Width of the sepal as measured using methodology X</attributeDefinition>
## <storageType>float</storageType>
## <measurementScale>
## <ratio>
## <unit>
## <standardUnit>millimeter</standardUnit>
## </unit>
## <numericDomain>
## <numberType>real</numberType>
## </numericDomain>
## </ratio>
## </measurementScale>
## </attribute>
## <attribute>
## <attributeName>Petal.Length</attributeName>
## <attributeDefinition>Length of the petal as measured using methodology X</attributeDefinition>
## <storageType>float</storageType>
## <measurementScale>
## <ratio>
## <unit>
## <standardUnit>millimeter</standardUnit>
## </unit>
## <numericDomain>
## <numberType>real</numberType>
## </numericDomain>
## </ratio>
## </measurementScale>
## </attribute>
## <attribute>
## <attributeName>Petal.Width</attributeName>
## <attributeDefinition>Width of the petal as measured using methodology X</attributeDefinition>
## <storageType>float</storageType>
## <measurementScale>
## <ratio>
## <unit>
## <standardUnit>millimeter</standardUnit>
## </unit>
## <numericDomain>
## <numberType>real</numberType>
## </numericDomain>
## </ratio>
## </measurementScale>
## </attribute>
## <attribute>
## <attributeName>Species</attributeName>
## <attributeDefinition>Species measured</attributeDefinition>
## <storageType>string</storageType>
## <measurementScale>
## <nominal>
## <nonNumericDomain>
## <enumeratedDomain>
## <codeDefinition>
## <code>setosa</code>
## <definition>Iris setosa</definition>
## </codeDefinition>
## <codeDefinition>
## <code>virginica</code>
## <definition>Iris virginica</definition>
## </codeDefinition>
## <codeDefinition>
## <code>versicolor</code>
## <definition>Iris versicolo</definition>
## </codeDefinition>
## </enumeratedDomain>
## </nonNumericDomain>
## </nominal>
## </measurementScale>
## </attribute>
## </attributeList>
We’ve described the attributes (columns) for iris.csv
but we haven’t describe iris.csv
itself. In EML, files like this are called entities
and entities
contain information about their file formats and more.
write.csv(iris, row.names = FALSE, "iris.csv")
physical <- set_physical("iris.csv",
size = as.character(file.size("iris.csv")),
authentication = digest::digest("iris.csv", algo = "md5", file = TRUE),
authMethod = "MD5")
physical
## <physical system="uuid">
## <objectName>iris.csv</objectName>
## <size unit="bytes">4026</size>
## <authentication method="MD5">5fe92fe6a2c1928ef5a67b8939fdaf8d</authentication>
## <dataFormat>
## <textFormat>
## <recordDelimiter>\n\r</recordDelimiter>
## <attributeOrientation>column</attributeOrientation>
## <simpleDelimited>
## <fieldDelimiter>,</fieldDelimiter>
## </simpleDelimited>
## </textFormat>
## </dataFormat>
## </physical>
Because iris.csv
is tabular, we create an entity
of type dataTable
:
dataTable <- new("dataTable",
entityName = "iris.csv",
entityDescription = "Edgar Anderosn's Iris data exported from R",
physical = physical,
attributeList = attributeList)
Note that the attributeList
we created before gets entered directly into the dataTable
entity.
eml
objectNow that we have everything all entered into R, we can create the
dataset <- new("dataset",
title = title,
creator = edgar,
pubDate = pubDate,
intellectualRights = intellectualRights,
abstract = abstract,
keywordSet = keywordSet,
coverage = coverage,
contact = contact,
methods = methods,
dataTable = dataTable)
## Warning: Person Edgar Anderson <edgaranderson@iris.net> was not given any
## role.
eml <- new("eml",
packageId = paste0("urn:uuid", uuid::UUIDgenerate()),
system = "uuid",
dataset = dataset)
Now that our eml
object is created, we can save it:
write_eml(eml, "eml.xml")
We should also validate the file:
eml_validate("eml.xml")
## [1] TRUE
## attr(,"errors")
## character(0)