linux.conf.au 2021 | Presentation: Arkisto: an open-source, standards-based framework for digital preservation

Presented by

Mike Lynch
https://mikelynch.org/

Mike works in the eResearch Support Group at the University of Technology Sydney and probably writes more code than his job description allows for. He has been providing specialised data management and IT support to academic researchers across a wide range of disciplines for the past ten years, and these days does most of his coding in JavaScript and Python. His recent work has focused on the use of open source software and open standards to describe, publish and preserve research data. He's also interested in generative art, data visualisation and functional programming.

Abstract

Arkisto is a project for digital data preservation which focuses on sustainability and reusability, based on open standards, taking a data-centric approach. The basis of an Arkisto repository is simply a file system, with digital artefacts laid out on disk or object storage according to the Oxford Common File Layout standard (OCFL), which provides efficient versioning and digital preservation features. Human- and machine-readable data descriptions are included using RO-Crate, a format for managing linked data using the JSON-LD and Schema.org standards. Datasets stored in this format can be then indexed using open-source tools such as Solr or ElasticSearch and made available for search, discovery and download. Arkisto is intended to address the issue of research data collections which are no longer being looked after, due to institutional changes, lack of funding or software support. Rather than locking collections in to a specialised or monolithic repository or database, Arkisto encourages a modular philosophy where the primary object, the data itself, is stored in long-lived formats which are easy to code against, and tools such as indices are relatively cheap to build, deploy and modify. I'll present a short explanation of the key standards and concepts behind Arkisto, and show a couple of applications of its use as a data publication repository at UTS, and at PARADISEC, an extensive collection of cultural data around endangered languages.