�ݺ�ߣ

European Archival Records and Knowledge Preservation
#earkproject www.eark-project.eu @EARKProject
An OAIS-oriented System for Fast Package
Creation, Search, and Access
Sven Schlarb, Rainer Schmidt, Roman Karl, Mihai Bartha, Jan
Rörden, Janet Delve, Kuldar Aas
Presenter: Sven Schlarb <sven.schlarb@ait.ac.at>
AIT Austrian Institute of Technology
IPRES 2016
Bern, October 3, 2016

THE
E-ARK PROJECT
IS
CO-FUNDED
BY THE
EUROPEAN
COMMISSION
UNDER THE
ICT-PSP
PROGRAMME
www.eark-project.eu

● E-ARK has defined a basic structure and recommended metadata standards for
information packages.
● E-ARK has created a reference implementation covering the functional entities for
ingest, archiving, and access according to the OAIS reference model.
● The SME partners KEEP Solutions and ESS have adapted their archiving solutions.
– RODA repository (KEEP)
– ESS Preservation Platform (ESS)
● AIT has developed an environment for processing information packages
(SIP, AIP, DIP).
– Providing a graphical front-end called earkweb.
● AIT has developed a scalable backend repository for storing, discovering, and
accessing data contained in information packages.
– Initially based on the Lily repository project, now Cloudera Search.
Main outcomes

• Modular package
transformation workflows
& metadata creation
• Parallelize full-text
indexing
•Fast random access
to individual files
•Aggregating data
using facet queries
•Data mining (Classification,
NER)
Faceted Search & Data Mining
Access
Full-text indexing & search
Package transformation and Ingest
Reference Implementation Functionality

• Pre-Ingest (Producer)
– Tasks: SIP Creation, Validation, Submission
– E-ARK Tools: Database Preservation Tooklit, RODA/EPP Tools, earkweb
• Ingest
– Tasks: SIP Validation, Archival Processing, AIP Creation
– E-ARK Tools: earkweb, RODA, EPP
• Archival Storage
– Tasks: Storage to Archival Repository
– E-ARK Tools: Lily Repository (Cloudera Search), RODA, EPP
• Data Management
– Tasks: Discover, Select, and Manipulate Records
– E-ARK Tools: Lily Repository, RODA, EPP
• Access
– Tasks: DIP Creation and activation (e.g. within an RDBMS)
– E-ARK Tools: earkweb, RODA
E-ARK Archival Workflow

SIP
E-ARK Information Package (simplified)
representations
metadata
[schemas/documentation]
Structural metadata
Provenance metadata
Technical metadata
Descriptive metadata
SIP
DIP
DIP
Lifecycle
Metadata edits
Migrations
Add emulation info

• earkweb is based on Phython and the Celery task
execution system.
– Create archival workflows from predefined tasks which
can be executed in parallel on a computer cluster.
– Examples are data validation, format migration, content
extraction, database transformation, packaging,
interfacing with storage systems.
– earkweb provides a graphical interface and can be
used interactively as well as in batch mode.
earkweb

• The E-ARK Lily repository provides a scalable
backend for storing, discovering, and accessing AIPs
based on technologies like SolR, MapReduce, and
HBase.
– The repository is entirely distributed allowing us to
handle huge amounts of data
– It provides full-text search, browsing, random access to
data contained in IPs.
– It provides APIs allowing one to carry out computations
(like data mining tasks) across the archived content.
E-ARK Lily/Hadoop Repository

6/30/16
Worker Worker Worker Worker
Staging/Storage Area
NAS
<<package transfer>>
decoupled
<<notification>>
<<search and retrieval>>
Information
package
status
Task
results
Cluster Deployment Stack

Standalone Deployment Stack
6/30/16
Worker Worker Worker Worker
Staging/Storage Area
NAS <<indexing>>
<<search and retrieval>>
Information
package
status
Task
results

Search & Access
• Search within and across information packages
– Full text index for office documents, PDF, MS Word, etc.
– Search based on defined fields, e.g. size, mime-type, package, etc.
– Results directly linked with the Lily content repository
• Faceted queries allowing to cluster search results into
different categories
• Spatio-temporal search in geographical datasets
• Filter search according to estimated text category
(machine learning/text classification)

E-ARK-iPRES2016-Bern-October-2016

Data Mining/NLP
• Purpose:
●
Show how to analyse digital resources contained in
the archive in an exemplary manner.
• Selected use cases:
●
Location names occurring in texts.
●
Named entity recognition and incorporation of geo-
information
●
Text classification

Location names occurring in texts
●
StanfordNER for NER
●
nominatim (database behind
openstreetmap.org) for georeferencing
●
peripleo for visualization

Location names occurring in texts
Peripleo - PELAGIOS Project

Geographical/timeline search
●
Provided: GML data and TIFF images of
maps with metadata (coordinate system,
time, etc.)
●
Convert GML data to Peripleo RDF
●
Translate coordinate system if necessary
●
Use peripleo to search for and visualize
regions and filter by time

Geographical/timeline search

Text classification using
scikit-learn
● Prepare data to train SVM classifier
● Dump full-texts of the repository into re-
usable packages
● Apply text classification and update SolR
records accordingly

Database archiving, rebuilding
and analysis
source: wikipedia
SIARD
RDBMS data
(up to 80TB)
e.g. Postgres e.g. Oracle
Submit ... Archive ... Reconstruct ... Analyse.

• National Archive of Hungary
– Full scale cluster deployment of earkweb and
Hadoop/lily back-end.
– Ingest, search, and access on large-volumes of AIPs.
• National Archive of Slovenia
– earkweb and Peripleo installation for ingesting,
visualising, and searching geo-data.
• Danish National Archives
– earkweb standalone installation
Current Pilots

Want to try it out?
• Single-machine deployment of the E-ARK
Reference Implementation available online:
http://earkdev.ait.ac.at/earkweb
• Oracle Virtualbox VM (Standalone
Deployment!) available for download:
http://earkdev.ait.ac.at/eark/pilots/eark-
pilot-vm.ova
• General information about E-ARK:
http://www.eark-project.eu

�ݺ�ߣ

E-ARK-iPRES2016-Bern-October-2016

More Related Content

E-ARK-iPRES2016-Bern-October-2016