DMP linkable icons
DMP-9: Data review and reprocessing |
The concept
Data will be managed to perform corrections and updates in accordance with reviews, and to enable reprocessing as appropriate; where applicable this shall follow established and agreed procedures.
Related terms: Data Curation, Data Reprocessing, File Format, Format Conversion.
Category: Curation
Explanation of the principle
Curation, normally [4] implies most, if not all the activities of DMPs 1 to 10. Thus, its meaning as one of the 5 foundational elements of the DMPs is narrower than its usual meaning, focusing exclusively in activities beyond appraisal/selection of data and data preservation (DMPs 7 & 8) and other activities intended to ensure discoverability (DMPs 1 & 4), accessibility (DMP 2), and usability (DMPs 3 to 6). In particular it focuses on correction, updating and reprocessing of data records (DMP 9) and the use of unique and persistent identifiers (DMP 10).
Most data management planning ends with its ingestion and the processing and interpretation of raw data. But, since data processors, who preserve the integrity and authenticity of the data, are well versed with software developments, advancements in computing technology, and processing algorithms, it has produced, as a natural development, the practice of extracting more and more information from the available data. This coincides with the key “social” and “scientific” goals of providing data to distinctive communities: long-term data sets and their usability by multiple stakeholders and communities. Combining such technological processes with scientific knowledge has led to the addition of new essential elements, adding value to data records, such as a) review [leading to corrections and updating] and b) reanalysis [with or without reprocessing i) when new technologies, including new formats for presentation, emerge, or ii) when data are reviewed by other communities using different processing tools].
Updates and Corrections have increasingly become a major purpose of databases in order to facilitate comparisons between different sets of data (e.g. between in situ observations -regionally, temporally, by technique, by investigator, etc.-, as well as between in situ and remotely sensed observations). Updating and correcting processed data can be time consuming, resource intensive, and constrained by time and interpreter choices to meet user needs.
Reprocessing can produce higher quality data (in particular fidelity images of multiple datasets of different categories of earth observations) than those created during initial processing. Data reprocessing is often necessary and can include, e.g., updating of the instrument calibration, taking account of current knowledge about sensor degradation and radiometric performance; or applying new knowledge in terms of data correction and/or derived products algorithms. Reprocessing also can change the output file format. Format conversion or reformatting might be an additional and usual consequence not necessary linked to reprocessing.
Guidance on Implementation, with Examples
Updates and corrections to submitted data sets is encouraged. Records of updates and corrections should be maintained; summaries of updates should be posted in the database, and users should be notified. Whether it should be the provider's or the data curator’s responsibility to ensure that the current data in the archive is identical to the data used in the most recent publications or current research is open to debate. But such responsibilities should be stated in data provision arrangements and transparent to users. Corrections might initiate debates (e.g. the July 2015 NOAA corrections of the dataset questioning the hiatus and slowdown of 21st century global temperature rise) but should not prevent implementation of correction policies and methodologies and results from being open to the designated communities.
Reprocessing should be strongly considered when 1) the quality of the end product from processing does not meet the objectives of the designated community and there is technology (whether new or from another community) available to improve it; 2) the data were processed with different objectives or with objectives appropriate only at the time of its processing; 3) when acquisitions of more data in adjoining areas or in the same area (with new parameters or type), necessitates reanalysis; 4) when new techniques and processing steps are more suitable to tackle the problem in the issue-area; 5) when new software is more suitable for processing the data; and/or 6) when new processing skills, experience and knowledge offer improvements.
Reprocessing has limitations. It can strain resources, including time, personnel, and expertise, requiring more quality control, interpretation, data handling and additional computer resources. Dataset or collection-specific limitations include software or hardware (e.g. processing systems and algorithm differences in various data sets limit or enhance ultimate quality), geographic-bound or time-bound data sets with bad data quality that are not suitable for reprocessing with the new technologies, and when new reprocessing techniques cannot overcome errors made during acquisition etc. Ideally, reprocessing should deliver new data products that are part of a very long time series. At times, data reprocessing needs a previous phase as a proof of concept before it becomes a broader initiative or a consolidated policy. Communication of strengths, limitations and uncertainties of reprocessed observations and reanalysis data to the developer community and the extended research community, including the new generations of researchers and the decision-makers, is crucial for further advancement of observational data records.
Metrics to measure level of adherence to the principle
Substantive metrics
Since usability is the main purpose of curation, metrics have traditionally been linked to citation metrics. Other metrics also are being considered (e.g. US NAS analysis of indicators of STI activities in the US and abroad that NCSES should produce; metrics on socio-economic benefits of interdisciplinary data curation from the Use of Earth Observations [5]). Concerning GEO, CEOS has made an unprecedented effort to develop a roadmap with specificity, actionability, responsibility, and desired outcomes in terms of quantitative metrics of ECVs, and there are ongoing exercises to provide metrics for the EBVs by the GEOBPN Leipzig Center. Qualitative descriptions also are valuable and should not be abandoned. See, e.g., Conway et al, describing impact of curation of data on disasters, health, energy, climate, water, ecosystems and agriculture [6]. Agreement on universal metrics may be difficult.
Process-based metrics
Does it make sense to “create” a metrics system (or scoreboard) based on whether institutional processes of updating, correction and reprocessing policies are under study, development or already in place, similar to those in DMP7? Or similar to the DCC data appraisal metrics?
Resource Implications of Implementation
Both updating and corrections, as well as reprocessing, are detailed, labor intensive, time-consuming, and prone to errors. Each reusable data set or collection requires specific reprocessing steps or techniques appropriate for the specific data set or group. Many variables impact the effectiveness of reprocessing, such as reprocessing challenges at individual facilities (time, expertise, computer equipment, quality and completeness of reprocessing instructions) and change due to technological evolution, since reprocessing requires precision, as well as periodic retraining to assure staff competence.
Reprocessing is still not considered strictly necessary in many areas. Climate change related observations are the paradigmatic data sets that need reprocessing since a major difficulty in understanding past climate change is that most systems used to make the observations that climate scientists now rely on were not designed with their needs in mind. Current observation system requirements for climate monitoring and model validation such as those specified by GCOS are rarely aligned with the capabilities of historical observing systems, emphasizing continuity and stability. It is no surprise that the GEO 2009-2011 Work Plan has only one task specifically addressing reprocessing: CL-06-01a on Sustained Reprocessing and Reanalysis of Climate Data. But even in this area, e.g. in the CEOS 2014-2016 Work Plan considers that only the data from the TOPEX/Poseidon mission ended in 2006 -VC‐13-, although it admits -CMRS‐3: Action plan (first version)- that it is necessary to create the conditions for delivering further climate data records from existing observational data by targeting processing gaps/shortfalls/opportunities (e.g., cross-calibration, reprocessing).
Alternatives to reprocessing such as OTFR (on-the-fly reprocessing) that generate real-time new data products or other dynamic data processing techniques (as well as migration to intermediate XML for file format conversions or e-streaming technologies) are still in their initial research or development phases.
Text extracted from the Data Management Principles Implementation Guidelines