際際滷

際際滷Share a Scribd company logo
European Archival Records and Knowledge Preservation
#earkproject www.eark-project.eu @EARKProject
An OAIS-oriented System for Fast Package
Creation, Search, and Access
Sven Schlarb, Rainer Schmidt, Roman Karl, Mihai Bartha, Jan
R旦rden, Janet Delve, Kuldar Aas
Presenter: Sven Schlarb <sven.schlarb@ait.ac.at>
AIT Austrian Institute of Technology
IPRES 2016
Bern, October 3, 2016
THE
E-ARK PROJECT
IS
CO-FUNDED
BY THE
EUROPEAN
COMMISSION
UNDER THE
ICT-PSP
PROGRAMME
www.eark-project.eu
 E-ARK has defined a basic structure and recommended metadata standards for
information packages.
 E-ARK has created a reference implementation covering the functional entities for
ingest, archiving, and access according to the OAIS reference model.
 The SME partners KEEP Solutions and ESS have adapted their archiving solutions.
 RODA repository (KEEP)
 ESS Preservation Platform (ESS)
 AIT has developed an environment for processing information packages
(SIP, AIP, DIP).
 Providing a graphical front-end called earkweb.
 AIT has developed a scalable backend repository for storing, discovering, and
accessing data contained in information packages.
 Initially based on the Lily repository project, now Cloudera Search.
Main outcomes
 Modular package
transformation workflows
& metadata creation
 Parallelize full-text
indexing
Fast random access
to individual files
Aggregating data
using facet queries
Data mining (Classification,
NER)
Faceted Search & Data Mining
Access
Full-text indexing & search
Package transformation and Ingest
Reference Implementation Functionality
 Pre-Ingest (Producer)
 Tasks: SIP Creation, Validation, Submission
 E-ARK Tools: Database Preservation Tooklit, RODA/EPP Tools, earkweb
 Ingest
 Tasks: SIP Validation, Archival Processing, AIP Creation
 E-ARK Tools: earkweb, RODA, EPP
 Archival Storage
 Tasks: Storage to Archival Repository
 E-ARK Tools: Lily Repository (Cloudera Search), RODA, EPP
 Data Management
 Tasks: Discover, Select, and Manipulate Records
 E-ARK Tools: Lily Repository, RODA, EPP
 Access
 Tasks: DIP Creation and activation (e.g. within an RDBMS)
 E-ARK Tools: earkweb, RODA
E-ARK Archival Workflow
SIP
E-ARK Information Package (simplified)
representations
metadata
[schemas/documentation]
Structural metadata
Provenance metadata
Technical metadata
Descriptive metadata
SIP
DIP
DIP
Lifecycle
Metadata edits
Migrations
Add emulation info
 earkweb is based on Phython and the Celery task
execution system.
 Create archival workflows from predefined tasks which
can be executed in parallel on a computer cluster.
 Examples are data validation, format migration, content
extraction, database transformation, packaging,
interfacing with storage systems.
 earkweb provides a graphical interface and can be
used interactively as well as in batch mode.
earkweb
 The E-ARK Lily repository provides a scalable
backend for storing, discovering, and accessing AIPs
based on technologies like SolR, MapReduce, and
HBase.
 The repository is entirely distributed allowing us to
handle huge amounts of data
 It provides full-text search, browsing, random access to
data contained in IPs.
 It provides APIs allowing one to carry out computations
(like data mining tasks) across the archived content.
E-ARK Lily/Hadoop Repository
6/30/16
Worker Worker Worker Worker
Staging/Storage Area
NAS
<<package transfer>>
decoupled
<<notification>>
<<search and retrieval>>
Information
package
status
Task
results
Cluster Deployment Stack
Standalone Deployment Stack
6/30/16
Worker Worker Worker Worker
Staging/Storage Area
NAS <<indexing>>
<<search and retrieval>>
Information
package
status
Task
results
Search & Access
 Search within and across information packages
 Full text index for office documents, PDF, MS Word, etc.
 Search based on defined fields, e.g. size, mime-type, package, etc.
 Results directly linked with the Lily content repository
 Faceted queries allowing to cluster search results into
different categories
 Spatio-temporal search in geographical datasets
 Filter search according to estimated text category
(machine learning/text classification)
E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016
Data Mining/NLP
 Purpose:

Show how to analyse digital resources contained in
the archive in an exemplary manner.
 Selected use cases:

Location names occurring in texts.

Named entity recognition and incorporation of geo-
information

Text classification
Location names occurring in texts

StanfordNER for NER

nominatim (database behind
openstreetmap.org) for georeferencing

peripleo for visualization
Location names occurring in texts
Peripleo - PELAGIOS Project
Geographical/timeline search
Peripleo - PELAGIOS Project

Provided: GML data and TIFF images of
maps with metadata (coordinate system,
time, etc.)

Convert GML data to Peripleo RDF

Translate coordinate system if necessary

Use peripleo to search for and visualize
regions and filter by time
Geographical/timeline search
Peripleo - PELAGIOS Project
Text classification using
scikit-learn
 Prepare data to train SVM classifier
 Dump full-texts of the repository into re-
usable packages
 Apply text classification and update SolR
records accordingly
Database archiving, rebuilding
and analysis
source: wikipedia
SIARD
RDBMS data
(up to 80TB)
e.g. Postgres e.g. Oracle
Submit ... Archive ... Reconstruct ... Analyse.
 National Archive of Hungary
 Full scale cluster deployment of earkweb and
Hadoop/lily back-end.
 Ingest, search, and access on large-volumes of AIPs.
 National Archive of Slovenia
 earkweb and Peripleo installation for ingesting,
visualising, and searching geo-data.
 Danish National Archives
 earkweb standalone installation
Current Pilots
Want to try it out?
 Single-machine deployment of the E-ARK
Reference Implementation available online:
http://earkdev.ait.ac.at/earkweb
 Oracle Virtualbox VM (Standalone
Deployment!) available for download:
http://earkdev.ait.ac.at/eark/pilots/eark-
pilot-vm.ova
 General information about E-ARK:
http://www.eark-project.eu

More Related Content

E-ARK-iPRES2016-Bern-October-2016

  • 1. European Archival Records and Knowledge Preservation #earkproject www.eark-project.eu @EARKProject An OAIS-oriented System for Fast Package Creation, Search, and Access Sven Schlarb, Rainer Schmidt, Roman Karl, Mihai Bartha, Jan R旦rden, Janet Delve, Kuldar Aas Presenter: Sven Schlarb <sven.schlarb@ait.ac.at> AIT Austrian Institute of Technology IPRES 2016 Bern, October 3, 2016
  • 2. THE E-ARK PROJECT IS CO-FUNDED BY THE EUROPEAN COMMISSION UNDER THE ICT-PSP PROGRAMME www.eark-project.eu
  • 3. E-ARK has defined a basic structure and recommended metadata standards for information packages. E-ARK has created a reference implementation covering the functional entities for ingest, archiving, and access according to the OAIS reference model. The SME partners KEEP Solutions and ESS have adapted their archiving solutions. RODA repository (KEEP) ESS Preservation Platform (ESS) AIT has developed an environment for processing information packages (SIP, AIP, DIP). Providing a graphical front-end called earkweb. AIT has developed a scalable backend repository for storing, discovering, and accessing data contained in information packages. Initially based on the Lily repository project, now Cloudera Search. Main outcomes
  • 4. Modular package transformation workflows & metadata creation Parallelize full-text indexing Fast random access to individual files Aggregating data using facet queries Data mining (Classification, NER) Faceted Search & Data Mining Access Full-text indexing & search Package transformation and Ingest Reference Implementation Functionality
  • 5. Pre-Ingest (Producer) Tasks: SIP Creation, Validation, Submission E-ARK Tools: Database Preservation Tooklit, RODA/EPP Tools, earkweb Ingest Tasks: SIP Validation, Archival Processing, AIP Creation E-ARK Tools: earkweb, RODA, EPP Archival Storage Tasks: Storage to Archival Repository E-ARK Tools: Lily Repository (Cloudera Search), RODA, EPP Data Management Tasks: Discover, Select, and Manipulate Records E-ARK Tools: Lily Repository, RODA, EPP Access Tasks: DIP Creation and activation (e.g. within an RDBMS) E-ARK Tools: earkweb, RODA E-ARK Archival Workflow
  • 6. SIP E-ARK Information Package (simplified) representations metadata [schemas/documentation] Structural metadata Provenance metadata Technical metadata Descriptive metadata SIP DIP DIP Lifecycle Metadata edits Migrations Add emulation info
  • 7. earkweb is based on Phython and the Celery task execution system. Create archival workflows from predefined tasks which can be executed in parallel on a computer cluster. Examples are data validation, format migration, content extraction, database transformation, packaging, interfacing with storage systems. earkweb provides a graphical interface and can be used interactively as well as in batch mode. earkweb
  • 8. The E-ARK Lily repository provides a scalable backend for storing, discovering, and accessing AIPs based on technologies like SolR, MapReduce, and HBase. The repository is entirely distributed allowing us to handle huge amounts of data It provides full-text search, browsing, random access to data contained in IPs. It provides APIs allowing one to carry out computations (like data mining tasks) across the archived content. E-ARK Lily/Hadoop Repository
  • 9. 6/30/16 Worker Worker Worker Worker Staging/Storage Area NAS <<package transfer>> decoupled <<notification>> <<search and retrieval>> Information package status Task results Cluster Deployment Stack
  • 10. Standalone Deployment Stack 6/30/16 Worker Worker Worker Worker Staging/Storage Area NAS <<indexing>> <<search and retrieval>> Information package status Task results
  • 11. Search & Access Search within and across information packages Full text index for office documents, PDF, MS Word, etc. Search based on defined fields, e.g. size, mime-type, package, etc. Results directly linked with the Lily content repository Faceted queries allowing to cluster search results into different categories Spatio-temporal search in geographical datasets Filter search according to estimated text category (machine learning/text classification)
  • 14. Data Mining/NLP Purpose: Show how to analyse digital resources contained in the archive in an exemplary manner. Selected use cases: Location names occurring in texts. Named entity recognition and incorporation of geo- information Text classification
  • 15. Location names occurring in texts StanfordNER for NER nominatim (database behind openstreetmap.org) for georeferencing peripleo for visualization
  • 16. Location names occurring in texts Peripleo - PELAGIOS Project
  • 17. Geographical/timeline search Peripleo - PELAGIOS Project Provided: GML data and TIFF images of maps with metadata (coordinate system, time, etc.) Convert GML data to Peripleo RDF Translate coordinate system if necessary Use peripleo to search for and visualize regions and filter by time
  • 19. Text classification using scikit-learn Prepare data to train SVM classifier Dump full-texts of the repository into re- usable packages Apply text classification and update SolR records accordingly
  • 20. Database archiving, rebuilding and analysis source: wikipedia SIARD RDBMS data (up to 80TB) e.g. Postgres e.g. Oracle Submit ... Archive ... Reconstruct ... Analyse.
  • 21. National Archive of Hungary Full scale cluster deployment of earkweb and Hadoop/lily back-end. Ingest, search, and access on large-volumes of AIPs. National Archive of Slovenia earkweb and Peripleo installation for ingesting, visualising, and searching geo-data. Danish National Archives earkweb standalone installation Current Pilots
  • 22. Want to try it out? Single-machine deployment of the E-ARK Reference Implementation available online: http://earkdev.ait.ac.at/earkweb Oracle Virtualbox VM (Standalone Deployment!) available for download: http://earkdev.ait.ac.at/eark/pilots/eark- pilot-vm.ova General information about E-ARK: http://www.eark-project.eu