earkweb is an open-source archiving and digital preservation system which is based on the reference model for an Open Archival Information System (OAIS Consultative Committee for Space Data Systems (CCSDS), ISO standard 14721:2012, https://public.ccsds.org/pubs/650x0m2.pdf) providing functions for Ingest, Archival Storage, Access, and Management of Information Packages. Information packages for Ingest, Archival Storage, and Access adhere to the eArchiving (E-ARK) information package specifications defined by the European Commission’s eArchiving initiative. https://digital-strategy.ec.europa.eu/en/activities/earchiving
The life cycle of an information package starts with providing an E-ARK Submission Information Package (E-ARK SIP https://dilcis.eu/specifications/sip) for ingest which can be either created by any external tool able to produce Information Packages conformant with the E-ARK SIP specification or using earkweb’s integrated SIP creator. During the ingest the system executes a series of workflow steps – including the validation of the Submission Information Package against the requirements of the specification – which in case of success ends with the creation of the Archival Information Package (AIP). It also supports the creation of Dissemination Information Packages (DIPs) and the indexing of them to enable access to and full-text search in Information Packages.
[1] Consultative Committee for Space Data Systems (CCSDS), ISO standard 14721:2012, https://public.ccsds.org/pubs/650x0m2.pdf
earkweb is Python/Django-based web application with a MySQL database for storing information about data sets and a Celery (http://www.celeryproject.org/) /RabbitMQ/Redis backend for asynchronous and scalable task processing with the following beneficial properties:
The task execution backend allows distributing tasks across multiple worker nodes, allowing for parallel processing and efficient resource utilization.
As workload increases, additional workers can be added to handle higher volumes of tasks.
It provides built-in mechanisms for handling task failures and retries, ensuring that tasks are completed successfully even in the presence of errors or failures.
The asynchronous task execution allows freeing up application resources to handle other tasks while long-running or resource-intensive tasks are processed in the background.
Tasks can be prioritised based on their importance or urgency, ensuring that critical tasks are processed promptly while less critical tasks can be queued for later execution.
The system integrates tools for monitoring task execution, tracking task progress, and managing worker nodes, allowing for effective monitoring and optimization of task processing performance.
The ingest process is implemented as a set of modular and extendible backend tasks. The execution of tasks can be monitored via the web frontend or using a REST API. earkweb also offers a pre-defined workflow for batch processing which executes the full chain of tasks for fully automated ingest of large volumes of data.
Links
- Demo environment: https://earkweb.sydarkivera.se/earkweb
- Source code: https://github.com/E-ARK-Software/earkweb
Further Information
- E-ARK specifications (managed by DILCIS board): https://dilcis.eu/specifications
- European Commission’s eArchiving Initiative: https://digital-strategy.ec.europa.eu/en/activities/earchiving