In order to transform CERN Information systems into Trustworthy Digital Repositories (auditable TDRs), a Digital Memory platform has been deployed in 2022. An important part of this effort is to supply a service that can be used for conveniently archiving digital artifacts, according to the OAIS specifications. The goal of the Digital Memory Platform is to provide CERN services with a simple, straightforward method of creating an OAIS compliant dark archive.
Below is the diagram showing the workflow of a data record from its original information system into its preserved bag (AIP), available from a global registry.
It is important to note that using an OAIS compliant archival system does not make a service OAIS compliant. It is part of an entire package of both technical and administrative policies. To be more precise, the OAIS platform has the following goals and non-goals:
Goals:
- Create system agnostic AIPs.
- Provide a simple interface for creating AIPs.
- Archival of complex digital objects.
- Reconstruction of complex digital objects.
- Archival of versioned objects.
- Reconstruction of versioned objects.
- Storage of AIPs.
Non-goals:
- Select objects to Archive.
- Provide metadata.
- Automatic versioning.
- Backups.
A very first release in test mode of the platform is available since 17 Oct. 2022 from http://preserve-qa.web.cern.ch from inside CERN only, supporting the workflow below.
Preservation concerns a selection of the scientific data and heritage data.
- Physics data
Physics data is directly derived from particle collisions. The production by detectors of this unique data has a significant cost and the acquired data is not easily reproducible in the long term. The challenge of preserving it comes mainly from their complexity and volume. Four levels of data are piling up: raw data, derived (or reconstructed) data, simplified data (for educational purposes) and publication data that give rise to analysis articles. Raw data is becoming more and more massive with new experiments: for example the LEP experiment produced about 100 TB per detector, the LHC produces a few hundreds PB per experiment and the HL-LHC is planned to acquire some tens of EB.
The Data Preservation in High Energy Physics (dphep.org) project is considering how best to combine ongoing initiatives of the CERN Open Data (cod) and the closed Analysis Preservation (cap) portals with preservation of data following FAIR principles. These two information systems are running on top of the open source Invenio digital repository software.
A pre-certification process to the ISO 16363 standard run in 2018 has shown that one of the missing component is the creation of proper Archival Information Packages as defined by the OAIS reference model.
- Heritage data
In parallel to scientific data, CERN, like most institutions, is committed to the preservation of its patrimony in digital formats. In addition to textual documents, like publications, preprints, studies or internal notes, CERN also maintains large collections of multimedia objects that are considered an important part of its heritage. The complete corpus of the 20th century audiovisual production (~6’000 tapes) and still images (~450’000) are being digitized by the CERN Digital Memory project and they should be merged with the recent digitally born assets.
Both new and past contents are managed through the CERN Document and Multimedia Servers (cds), which are using the same storage layout: the CERN Data center, together with physics data. Another common point with the Open Data and Analysis Preservation services is the underlying software to capture and provide access to the content, Invenio.
When considering compliancy with ISO 16363, the effort of the digital memory and data preservation projects naturally converged as the same layers are shared at both filesystem and software levels. Moreover, the OpenAire long-tail of science data repository service, Zenodo is also based on Invenio and has joined effort to a common process for creating AIPs stored on CERN Cloud.