Digital preservation at CERN

In order to transform CERN Information systems into Trustworthy Digital Repositories (auditable TDRs), a Digital Memory platform has been deployed in 2022. An important part of this effort is to supply a service that can be used for conveniently archiving digital artifacts, according to the OAIS specifications. The goal of the Digital Memory Platform is to provide CERN services with a simple, straightforward method of creating an OAIS compliant dark archive.

Below is the diagram showing the workflow of a data record from its original information system into its preserved bag (AIP), available from a global registry.

It is important to note that using an OAIS compliant archival system does not make a service OAIS compliant. It is part of an entire package of both technical and administrative policies. To be more precise, the OAIS platform has the following goals and non-goals:

Goals:

  1. Create system agnostic AIPs.
  2. Provide a simple interface for creating AIPs.
  3. Archival of complex digital objects.
  4. Reconstruction of complex digital objects.
  5. Archival of versioned objects.
  6. Reconstruction of versioned objects.
  7. Storage of AIPs.

Non-goals:

  1. Select objects to Archive.
  2. Provide metadata.
  3. Automatic versioning.
  4. Backups.

A very first release in test mode of the platform is available since 17 Oct. 2022 from http://preserve-qa.web.cern.ch from inside CERN only, supporting the workflow below.

Preserve platform workflow


Preservation concerns a selection of the scientific data and heritage data.  

  • Physics data

Physics   data   is   directly   derived   from  particle   collisions.   The   production   by  detectors   of   this   unique   data   has   a   significant  cost   and   the   acquired   data   is   not   easily  reproducible   in   the   long   term.   The   challenge  of   preserving   it   comes   mainly   from   their  complexity and volume. Four   levels   of   data   are   piling   up:   raw   data,  derived   (or   reconstructed)   data,   simplified  data   (for   educational   purposes)   and  publication   data   that   give   rise   to  analysis  articles. Raw   data   is   becoming   more   and   more  massive   with   new   experiments:   for   example  the   LEP   experiment   produced   about   100   TB  per   detector,   the   LHC  produces   a   few  hundreds   PB   per   experiment   and   the   HL-LHC  is planned to acquire some tens of EB.


The   Data   Preservation   in   High   Energy  Physics  (dphep.org) project   is   considering   how   best   to  combine   ongoing   initiatives   of   the   CERN   Open  Data   (cod)   and   the   closed   Analysis  Preservation   (cap)   portals   with   preservation  of   data   following   FAIR   principles.   These   two  information   systems   are   running   on   top   of   the  open   source   Invenio   digital   repository  software.

A  pre-certification process to the ISO 16363 standard run in 2018 has shown that one of the  missing component is the  creation   of   proper Archival Information Packages as defined by the OAIS reference  model. 

  • Heritage data 

In   parallel   to   scientific   data,   CERN,   like  most   institutions,   is   committed   to   the  preservation   of   its   patrimony   in   digital  formats. In   addition   to   textual   documents,   like  publications,   preprints,   studies   or   internal  notes,   CERN   also   maintains   large   collections   of  multimedia   objects   that   are   considered   an  important   part   of   its   heritage.   The   complete  corpus   of   the   20th   century   audiovisual  production   (~6’000   tapes)   and   still   images  (~450’000)   are being   digitized   by   the  CERN   Digital   Memory   project   and   they   should be   merged   with   the   recent   digitally   born  assets. 

Both   new   and   past   contents   are   managed  through   the   CERN   Document   and   Multimedia  Servers   (cds),   which   are   using   the   same  storage   layout:   the   CERN   Data   center, together   with   physics   data.   Another   common  point   with   the   Open   Data   and   Analysis  Preservation   services   is   the   underlying  software  to  capture and provide   access   to   the  content, Invenio.

When   considering   compliancy   with   ISO  16363,   the   effort   of   the   digital   memory   and  data  preservation  projects  naturally  converged   as   the   same   layers   are   shared   at  both   filesystem   and   software   levels.  Moreover,   the   OpenAire   long-tail   of   science  data   repository   service,   Zenodo  is   also  based   on   Invenio   and   has   joined   effort   to   a  common   process   for   creating  AIPs   stored on  CERN Cloud.