The document summarizes the European Archival Records and Knowledge (E-ARK) project, which developed an OAIS-compliant system for fast creation, search, and access of archival information packages. It describes the key components and functionality of the E-ARK reference implementation, including tools for ingest, archival storage, data management, access, and data mining of archived content. Current pilots of the E-ARK system are being used by several national archives for large-scale archiving and access of records.
1 of 22
Download to read offline
More Related Content
E-ARK-iPRES2016-Bern-October-2016
1. European Archival Records and Knowledge Preservation
#earkproject www.eark-project.eu @EARKProject
An OAIS-oriented System for Fast Package
Creation, Search, and Access
Sven Schlarb, Rainer Schmidt, Roman Karl, Mihai Bartha, Jan
R旦rden, Janet Delve, Kuldar Aas
Presenter: Sven Schlarb <sven.schlarb@ait.ac.at>
AIT Austrian Institute of Technology
IPRES 2016
Bern, October 3, 2016
3. E-ARK has defined a basic structure and recommended metadata standards for
information packages.
E-ARK has created a reference implementation covering the functional entities for
ingest, archiving, and access according to the OAIS reference model.
The SME partners KEEP Solutions and ESS have adapted their archiving solutions.
RODA repository (KEEP)
ESS Preservation Platform (ESS)
AIT has developed an environment for processing information packages
(SIP, AIP, DIP).
Providing a graphical front-end called earkweb.
AIT has developed a scalable backend repository for storing, discovering, and
accessing data contained in information packages.
Initially based on the Lily repository project, now Cloudera Search.
Main outcomes
4. Modular package
transformation workflows
& metadata creation
Parallelize full-text
indexing
Fast random access
to individual files
Aggregating data
using facet queries
Data mining (Classification,
NER)
Faceted Search & Data Mining
Access
Full-text indexing & search
Package transformation and Ingest
Reference Implementation Functionality
5. Pre-Ingest (Producer)
Tasks: SIP Creation, Validation, Submission
E-ARK Tools: Database Preservation Tooklit, RODA/EPP Tools, earkweb
Ingest
Tasks: SIP Validation, Archival Processing, AIP Creation
E-ARK Tools: earkweb, RODA, EPP
Archival Storage
Tasks: Storage to Archival Repository
E-ARK Tools: Lily Repository (Cloudera Search), RODA, EPP
Data Management
Tasks: Discover, Select, and Manipulate Records
E-ARK Tools: Lily Repository, RODA, EPP
Access
Tasks: DIP Creation and activation (e.g. within an RDBMS)
E-ARK Tools: earkweb, RODA
E-ARK Archival Workflow
7. earkweb is based on Phython and the Celery task
execution system.
Create archival workflows from predefined tasks which
can be executed in parallel on a computer cluster.
Examples are data validation, format migration, content
extraction, database transformation, packaging,
interfacing with storage systems.
earkweb provides a graphical interface and can be
used interactively as well as in batch mode.
earkweb
8. The E-ARK Lily repository provides a scalable
backend for storing, discovering, and accessing AIPs
based on technologies like SolR, MapReduce, and
HBase.
The repository is entirely distributed allowing us to
handle huge amounts of data
It provides full-text search, browsing, random access to
data contained in IPs.
It provides APIs allowing one to carry out computations
(like data mining tasks) across the archived content.
E-ARK Lily/Hadoop Repository
9. 6/30/16
Worker Worker Worker Worker
Staging/Storage Area
NAS
<<package transfer>>
decoupled
<<notification>>
<<search and retrieval>>
Information
package
status
Task
results
Cluster Deployment Stack
11. Search & Access
Search within and across information packages
Full text index for office documents, PDF, MS Word, etc.
Search based on defined fields, e.g. size, mime-type, package, etc.
Results directly linked with the Lily content repository
Faceted queries allowing to cluster search results into
different categories
Spatio-temporal search in geographical datasets
Filter search according to estimated text category
(machine learning/text classification)
14. Data Mining/NLP
Purpose:
Show how to analyse digital resources contained in
the archive in an exemplary manner.
Selected use cases:
Location names occurring in texts.
Named entity recognition and incorporation of geo-
information
Text classification
15. Location names occurring in texts
StanfordNER for NER
nominatim (database behind
openstreetmap.org) for georeferencing
peripleo for visualization
17. Geographical/timeline search
Peripleo - PELAGIOS Project
Provided: GML data and TIFF images of
maps with metadata (coordinate system,
time, etc.)
Convert GML data to Peripleo RDF
Translate coordinate system if necessary
Use peripleo to search for and visualize
regions and filter by time
19. Text classification using
scikit-learn
Prepare data to train SVM classifier
Dump full-texts of the repository into re-
usable packages
Apply text classification and update SolR
records accordingly
20. Database archiving, rebuilding
and analysis
source: wikipedia
SIARD
RDBMS data
(up to 80TB)
e.g. Postgres e.g. Oracle
Submit ... Archive ... Reconstruct ... Analyse.
21. National Archive of Hungary
Full scale cluster deployment of earkweb and
Hadoop/lily back-end.
Ingest, search, and access on large-volumes of AIPs.
National Archive of Slovenia
earkweb and Peripleo installation for ingesting,
visualising, and searching geo-data.
Danish National Archives
earkweb standalone installation
Current Pilots
22. Want to try it out?
Single-machine deployment of the E-ARK
Reference Implementation available online:
http://earkdev.ait.ac.at/earkweb
Oracle Virtualbox VM (Standalone
Deployment!) available for download:
http://earkdev.ait.ac.at/eark/pilots/eark-
pilot-vm.ova
General information about E-ARK:
http://www.eark-project.eu