Preserve folder structures
Sometimes, researchers will upload zip files that a contain nested file and folder structure that we would like to maintain. This reference section will walk you through how to re-upload the contents of the zip file to the Arctic Data Center such that the files and folders are preserved. Note that changing the locations of the files within the package can be tricky, so take these steps with care and try to make sure it is done correctly.
With that, here are the steps assuming that the PI has uploaded one zip file to their dataset that holds all their files organized in their desired file hierarchy. You may need to modify the steps for other scenarios, but if you are not sure, feel free to ask the data coordinator. In particular, tar files (.tgz) or rar files (.rar) are also compressed archives that might be better to unpack using command line tools.
Download the zip file to datateam
First, we will download the file, using R.
- Navigate to the dataset landing page on the Arctic Data Center
- Right click the “Download” button next to the zip file
- Select “Copy Link Address”
- Run the following two lines of code to set the URL variable, and extract the pid
Here is an example on the test site. Note that on production, you will need to change the URL that you are substituting in the second line of code.
library(magrittr)
url <- "https://test.arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A48c0e669-fd8a-4875-acfd-e8933bb350ed"
pid <- gsub("https://test.arcticdata.io/metacat/d1/mn/v2/object/", "", url) %>% gsub("%3A", ":", .)- Download the file
Note that this will download the file to the location you specify as the second argument in writeBin. If you are organizing your scripts and data by submitter, it would look like the example below.
- Unzip the file into your submitter directory.
- Delete the zip file (example.zip)
Now if you look at the directory, you should see the unzipped contents of the file in a sub-directory of ~/submitter/data. The name of the directory will be the name of the folder the PI created the archive from. In this case, that folder is titled final_image_set.
Right now, you should stop and examine each file in the directory closely (or each type of file). You may need to make some minor adjustments or ask for clarification from the PI. For example, we still may need to ask for CSV versions of Excel files, you may need to re-zip certain directories (for example: a zip which contains 5 different sets of shapefiles should be turned into 5 different zips). Evaluate the contents of the directory alongside the data coordinator.
If the files are not already in Metacat, you can also upload the zip folder into the datateam server. The main point here is that we want to have the files on the datateam server so that we can use R to add them to the data package, and we want to make sure the files are organized in a way that makes sense before we do that.
Re-upload the contents to the Arctic Data Center
Once you have confirmed everything is all good, we can upload the files to the ADC while preserving the directory structure.
First, get the data package loaded into your R session as usual. I recommend not attempting any EML edits while you do these steps. This update of adding the files is best done on it’s own, and EML edits can be done on the next version.
dp <- getDataPackage(d1c, identifier = resourceMapId, lazyLoad = TRUE, quiet = FALSE) # Gather data packageNext, we will set up two types of paths describing each of the objects. The first will be an absolute path, so we can be sure the R function finds the files. The second will be a relative path, which will be what shows up on the landing page (or in the “Download All” result) of the data package.
If you don’t know the difference between an absolute and relative path, read on. It is SUPER IMPORTANT!
A path is a location of a file or folder on a computer. There are two types of paths in computing: absolute paths and relative paths.
Absolute paths always start with the root of your file
system and locates files from there. The absolute path to my example
submitter zip is: /home/jclark/submitter/data/example.zip.
The generic shortcut ~/ is often used to replace the
location of your home directory (/home/username) to save
typing, but your path is still an absolute path if it starts with
~/. Note that a relative path will always
start with either ~/ or /.
Relative paths start from some location in your file system
that is below the root. Relative paths are combined with the path of
that location to locate files on your system. R (and some other
languages like MATLAB) refer to the location where the relative path
starts as our working directory. If our working directory is set to
~/submitter, the relative path to the zip would be just
data/example.zip. Note that a relative path will
never start with either ~/ or
/.
Getting these paths right is very important because we don’t want submitters to download a folder of data, and have the paths look like /home/internname/ticket_27341/important_folder/important_file.csv. The first part of that absolute path is particular to however the person processing the ticket organized the data, and is not how the submitter of the data intended to organize the data. Follow the steps below to make sure this does not happen.
Get a list of absolute paths for each file in the directory. NOTE The “PI_dir_name” here represents whatever directory you retrieved after running
unzipin the previous step. The actual .zip file should not be in this directory. In our example, this “PI_dir_name” isfinal_image_set.Get a list of relative paths for each file in the directory. Note this is the same command, but with the argument
full.namesset toFALSE.
abs_paths <- list.files("~/submitter/data", full.names = TRUE, recursive = TRUE)
rel_paths <- list.files("~/submitter/data", full.names = FALSE, recursive = TRUE)
Make sure that these paths look correct! They should contain ONLY the
files that were unzipped. If you have other scripts or metadata files
you might want to rearrange your directories to get the correct paths.
The relative paths should start with the submitter’s directory
name. In this example, that submitter’s top-level directory is titled
final_image_set, so they will look like the names
below:
“final_image_set/level1.png” “final_image_set/photos/level2_1.png” “final_image_set/photos/level2_2.png”
Now for each of these files, we can create a dataObject for them and add them to the package using a loop. Before running this, look at the values of your abs_paths and rel_paths and make sure they look correct based on what you know about both paths and the structure of the directory. Within this loop, we will also create otherEntities for each item, just putting in the bare minimum of information that will help us make sure that we know what files are what. If you’re more familiar with EML, you can add more information to the otherEntity section or even add them in as dataTables, spatialVectors, or spatialRasters, but the main point here is to make sure that the relationships are correct and the files show up on the landing page in a way that makes sense.
metadataId <- selectMember(dp, name="sysmeta@formatId", value="https://eml.ecoinformatics.org/eml-2.2.0") # Get metadata PID
doc <- read_eml(getObject(d1c@mn, metadataId)) # Read in metadata EML file
oes <- list()
for (i in 1:length(abs_paths)) {
formatId <- arcticdatautils::guess_format_id(abs_paths[i])
id <- generateIdentifier(d1c@mn, scheme = "uuid")
dataObj <- new("DataObject", format = formatId, filename = abs_paths[i], targetPath = rel_paths[i], id = id)
dataObj@sysmeta@replicationAllowed <- FALSE # This is needed to prevent replication for the ADC
dp <- addMember(dp, dataObj, metadataId)
oes[[i]] <- eml$otherEntity(entityName = rel_paths[i], entityType = formatId, id = id) # Can add entityDescription in this command also
}
doc$dataset$otherEntity <- NULL # Removing all otherEntities in the data package, assuming there are only ZIP files in the dp
doc$dataset$otherEntity <- oes # Adding new otherEntities to section
eml_validate(doc)
write_eml(doc, "~/metadata.xml")
dp <- replaceMember(dp, metadataId, replacement="~/metadata.xml")Once this is finished you can examine the relationships by running View(dp@relations$relations). If everything worked out well you should see rows that look like this:
urn:uuid:a398312d-2c87-4b19-8380-3f11d2a1922d | http://www.w3.org/ns/prov#atLocation | figure2/figure2.tif |These columns represent the “subject”, “predicate”, and “object” of the relationship, respectively. The “subject” is the PID of the data object, the “predicate” is the type of relationship (in this case, it is saying that the data object is located at a certain path), and the “object” is the relative path to the file in the data package. Note that the “object” should be just the relative path to the file, and should not contain any absolute paths.
The targetPath (3rd column) should NOT look like
urn:uuid:a398312d-2c87-4b19-8380-3f11d2a1922d | http://www.w3.org/ns/prov#atLocation | /home/jclark/ArcticSupport/Zhao/dataset1/figure2/figure2.tif |Make sure this looks correct, then update the data package with:
Finally, check your work. Go to the Arctic Data Center and see if the package displays correctly. Edit the package using the user interface to remove the zip file. Continue with metadata updates as normal.
Summary
Here is the example code all put together. Make sure you change all of the relevant bits, and check your work carefully!!
### Downloading ZIP file
url <- "https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A8fee5046-1a8f-4ccc-80f2-70c557a66338"
pid <- gsub("https://arcticdata.io/metacat/d1/mn/v2/object/", "", url) %>% gsub("%3A", ":", .)
writeBin(getObject(d1c@mn, pid), "~/submitter/data/example.zip")
unzip("~/submitter/data/example.zip", exdir = "~/submitter/data")
### Re-uploading contents to data package
d1c <- dataone::D1Client("STAGING", "urn:node:mnTestARCTIC") # Setting the Member Node
dp <- getDataPackage(d1c, identifier = resourceMapId, lazyLoad = TRUE, quiet = FALSE) # Gather data package
##### Gather Metadata EML
metadataId <- selectMember(dp, name="sysmeta@formatId", value="https://eml.ecoinformatics.org/eml-2.2.0") # Get metadata PID
doc <- read_eml(getObject(d1c@mn, metadataId)) # Read in metadata EML file
##### Get paths
abs_paths <- list.files("~/submitter/data", full.names = TRUE, recursive = TRUE)
rel_paths <- list.files("~/submitter/data", full.names = FALSE, recursive = TRUE)
oes <- list()
for (i in 1:length(abs_paths)) {
formatId <- arcticdatautils::guess_format_id(abs_paths[i])
id <- generateIdentifier(d1c@mn, scheme = "uuid")
dataObj <- new("DataObject", format = formatId, filename = abs_paths[i], targetPath = rel_paths[i], id = id)
dataobj@sysmeta@replicationAllowed <- FALSE # This is needed to prevent replication for the ADC
dp <- addMember(dp, dataObj, metadataId)
oes[[i]] <- eml$otherEntity(entityName = rel_paths[i], entityType = formatId, id = id) # Can add entityDescription in this command also
}
doc$dataset$otherEntity <- NULL # Removing otherEntity of zip file
doc$dataset$otherEntity <- oes # Adding otherEntities to section
### Validate and save EML
eml_validate(doc)
write_eml(doc, "~/metadata.xml")
### Upload Dataset
dp <- replaceMember(dp, metadataId, replacement="~/metadata.xml") # Replace metadata file
myAccessRules <- data.frame(subject="CN=arctic-data-admins,DC=dataone,DC=org", permission="changePermission")
packageId <- uploadDataPackage(d1c, dp, public=F, accessRules=myAccessRules, quiet=FALSE)One note: if you already have files in your data package you want to keep in your dataset along with the ZIP file contents you’re adding, you can run doc$dataset$otherEntity <- list(doc$dataset$otherEntity, oes) so that you’ll just be adding your new entities instead of replacing the older ones.
Data packages with multiple ZIP files
If there are multiple data packages, you will just need to make sure that all the zip files are unzipped in the same directory, and that the list.files command is pointed to that directory. The loop will then run through all the files in all the unzipped folders and add them to the data package with their relative paths. Just make sure to check your work carefully, and make sure the paths look correct before you upload the package.
Adjusting the folder hierarchy of files already in the data package
If you need to adjust the folder hierarchy of files that are already in the data package, you can do that by changing the dp@relations$relations data frame. As discussed above, the “object” column of the rows with the predicate http://www.w3.org/ns/prov#atLocation contains the relative paths to the files in the data package. You can add new rows to this data table and add these paths in the “object” column with the PID of the data object in the “subject” column to adjust the folder hierarchy. Just make sure to check your work carefully, and make sure the paths look correct before you upload the package.