DMP linkable icons
DMP-5: Provenance (data traceability) |
The concept
Data will include provenance metadata indicating the origin and processing history of raw observations and derived products, to ensure full traceability of the product chain.
Related terms: Provenance, Traceability.
Category: Usability
Explanation of the principle
Provenance information is given as part of the metadata to the data. Some provenance information can be captured automatically by the process step tools involved and accumulated in the metadata of the resulting dataset. Ideally, a process step will inherit the provenance of data sources and add information about the current process step. Other elements of provenance can be captured manually, including names of parties that created, updated or maintained the dataset.
Provenance is considered a complement to data quality information. In the absence of quantitative information about the uncertainties of the data, expert users can infer data quality estimations from the uncertainties of the sources and from the confidence in the process steps applied. In addition, the reputation of the responsible party of the dataset sources and the result can be used to increase confidence in, as well as an indication of the uncertainties on, the dataset.
The accessibility of the original data source’s metadata and processing algorithm descriptions is also a metric of the usability of the provenance information. If provenance is describing sources and processing tools that are not available (or at least have some available documentation), such information cannot be effective in the end. Provenance can help users identify a problem in a basic dataset or improve it. Provenance information about other products can help to identify which products were derived from the affected dataset. Provenance can help to recreate (or reproduce) the dataset when the problem in the basic dataset is fixed or an improved version is available. Provenance information can also be used to assess the homogeneity of a dataset series where some members of the series originated from sources with different time extents or different versions of the processing algorithms. Provenance can be provided at different levels such as dataset series, dataset, feature, attribute type, attribute etc. For example, this is useful to determine the source of features of even attribute values in the case that a dataset is the result of merging elements features from different sources. Provenance at the dataset level is usually stored in the dataset metadata (that, in the case of GEOSS, it is accessible by the Discovery and Access Broker) while provenance at the feature and attribute level is usually stored in the dataset itself as additional properties of the feature, requiring data access to get them.
Guidance on Implementation, with Examples
1. Automatic metadata creation: Tools that create and manipulate the data also should produce provenance documentation automatically to avoid losing steps or incorrectly documenting metadata. Tools need to inherit the provenance from previous sources. References to algorithms and versions need to be added.
2. Provenance metadata presence and completeness: Datasets should be tested for the presence of metadata about provenance information, which should include a clear sequential description of all sources, processing steps, and responsible parties.
3. Provenance metadata correctness: Ensure that data sources are documented using universal identifiers (many times local file names are documented) and ideally pointing to accessible sources, that processing algorithms are well maintained and accessible, and that responsible party information is current and points to an accessible party.
4. Provenance Visualization: Provenance information can sometimes be very complex. Tools for interpreting provenance and generating graphs can enhance understanding.
Metrics to measure level of adherence to the principle
1. Presence of information about data sources, process steps, and responsible parties in the metadata distributed with the data. This can be done by verifying the sources and process steps documented in the lineage model of ISO 19115 and ISO 19115-2" Geographic Information – Metadata” that the Discover and Access Broker provide for each GEOSS resource.
2. The accessibility of the original data source’s metadata and processing algorithm descriptions is a metric of the usability of the provenance. For sources, this can be obtained by checking the source URI and finding out if they are available for downloading.
Resource Implications of Implementation
This is part of the metadata process and the costs can be absorbed in this concept. There are two associated costs:
1. Implementing automatic metadata procedures in the processing tools and processing chain;
2. Complementing the automatic tools with a manual edition and review.
Text extracted from the Data Management Principles Implementation Guidelines