Common Dataset Organization and Policy¶
This document covers the specific format and policies governing the shared datasets at the USDF, including space available on the login nodes and on all of the compute nodes of the Batch Systems.
Datasets covered by this policy include raws, calibration files, and refcats. “Refcats” refers to reference catalogs that are used for calibration (astrometric or photometric). Other types of catalogs may be used as references (e.g. DC2 truth tables) but will be referred to as external catalogs.
File Paths¶
The following file paths contain shared datasets:
/sdf/group/rubin/datasets
(henceforth/datasets
for short) is a symlink to/sdf/data/rubin/shared/ncsa-datasets
, containing datasets previously stored at NCSA under the now-defunct/datasets
path (i.e. the/datasets
symlink no longer exists at S3DF)./sdf/group/rubin/shared
(henceforth/shared
for short) is a symlink to/sdf/data/rubin/shared
and is the preferred path for new shared datasets, as well as for migrating older datasets./sdf/group/rubin/user
is a symlink to/sdf/data/rubin/user
and contains user home directories. Shared datasets may reside here temporarily for prototyping but should be moved to/shared
once they start being used by multiple users.
Policy¶
New shared datasets should be added to /shared
.
Any additions or changes to datasets to be included in a shared butler and/or used in regular (re-)processing must have a corresponding RFC.
Other datasets must include an implementation ticket.
The RFC and/or implementation ticket should contain information about:
Description and reason for addition/change/deletion
Target top-level-directory for location of addition/change/deletion
Organization of data
Required disk space
Other necessary domain knowledge as identified by project members relating to the contents of the data
External datasets not yet used in regular reprocessing should have a corresponding Jira ticket with similar information.
All newly-added datasets, including external datasets, must follow the guidelines for supplying a README file. Updates to the readme should be reviewed on subsequent Jira tickets.
Requests for new shared directories should be emailed to usdf-help@slac.stanford.edu
.
Members of the rubinmgr
group will handle these, including having quotas applied.
Requesting users are often given initial ownership of the shared directory and are responsible for setting appropriate permissions.
If the shared dataset needs central curation, ownership may be set to rubinmgr
after it is initially populated.
More sophisticated options to grant temporary unlocks for modification or to permanently allow curation by a group of users are available on request.
Format¶
Most data in /datasets
adheres to the following Gen2 format conventions (caps are tokens):
/datasets/<camera>/[REPO|RERUN|PREPROCESSED|SIM|RAW|CALIB] | /datasets/REFCATS
where
REPO = repo (
butler
root)SIM = <ticket>_<date>/ | <user>/<ticket>/
CALIB = calib/<date> (ex. master20161025)
RAW = raw/<survey-name>/ (where actual files live)
REFCATS = refcats/<type>/<version>/<label> (ex. astrometry_net_data/sdss-dr8, htm/v1/gaia_DR1_v1)
The datasets still in use have been ingested via symlink to current Gen3 Butler repositories, and users generally will not need to interact with them.
Additional legacy datasets may reside under the RERUN and PREPROCESSED tags, as well as under /datasets/all-sky
.
Reference Catalogs¶
Gen2 reference catalogs in /datasets
were ingested into a version subdirectory (e.g. v0/
, v1
) matching the REFCAT_FORMAT_VERSION
set by the refcat ingestion task. New refcats should follow the policies to be detailed in DM-31704.
Here is a template for what each refcat’s readme should contain:
Reference Catalog: Example
##########################
Sky coverage: full sky (or give ra/dec range)
Number of sources: 1,234,567
Magnitude range: 10 - 20 (G magnitude)
Disk space: 100 GB
Original data: https://www.example.com/DataRelease9000
Jira ticket or Epic: https://rubinobs.atlassian.net/browse/DM-Example
Jira acceptance RFC: https://rubinobs.atlassian.net/browse/RFC-Example
Contact: Example name, email@example.com, Slack: examplename
This is a brief paragraph summarizing this reference catalog.
Citations/acknowledgements
==========================
Users of this reference catalog should follow the citation and
acknowledgement instructions from this website:
https://www.example.com/citations
Catalog creation
================
This catalog was created by following the instructions on this page:
https://pipelines.lsst.io/modules/lsst.meas.algorithms/creating-a-reference-catalog.html
The configuration that was used to ingest the data is included in this
directory as `IngestIndexedReferenceTask.py`.
Butler Ingest¶
Shared datasets to be ingested to shared Gen3 Butler repositories should follow established conventions (also to be clarified in DM-31704).
Existing repos generally contain instrument-specific datasets in a collection prefixed by the instrument name (e.g. HSC/raw
).
Instrument-agnostic datasets may be prefixed by a relevant name, e.g. injection
for source injection datasets or pretrained_models
.
External datasets should be included with an external
prefix, e.g. external/catalogs
or external/imaging
.
The RFC/ingestion ticket should determine whether external datasets need corresponding dimensions.
For example, a multi-band, multi-instrument catalog covering a small area like COSMOS needs no dimensions, whereas larger catalogs may benefit from htm spatial sharding.
Pre-processed images could benefit from an instrument and filter; best practices for dataset type specification and spatial sharding are TBD.
README Guidelines¶
Ticket creator is responsible for butler-ization of dataset (or delegation of responsibility).
Responsibility for maintaining usable datasets is a DM-wide effort.
Regardless of the reason for the RFC (implementation or maintenance), as part of implementing the RFC, any relevant information from the RFC should be transferred to a README.txt
file at the root level of the dataset.
There is no limit to how much information can be put in README.txt
, however at the minimum, it should contain:
A description of the instrument and observatory that produced the data
The intended purpose of the dataset
At least a high level summary of the selection criteria for the dataset
The primary point of contact for questions about the dataset. Name is sufficient, but email would be appreciated.
If preprocessed, a description of the preprocessed data products available
If a subset is preprocessed, a description of how the subset was created (and why)
For butler repository datasets, the root level is the directory just above the butler repository: e.g. /datasets/hsc/README.txt
.
For reference catalogs, there should be one README.txt
for all reference catalogs of a particular type: e.g. /datasets/refcats/htm/README.txt
with a brief description of the available reference catalogs of that type.
Separately, each reference catalog should also contain a README.txt
with details about that reference catalog’s contents.
See datasets_reference-catalogs_usdf for a template for the contents of those respective readme files.