Text Analysis Seminar at the G旦ttingen Center of Digital Humanities. 04.07.2012
In this lecture I present an experiment comparing the efficacy of several Named Entity Extraction (NEE) tools at extracting entities directly from the output of an optical character recognition (OCR) workflow. The presentation will discuss the creation of a set of test data consisting of raw and manually corrected OCR output, comparing the precision and recall in the extraction of entities of type PERSON, LOCATION and ORGANIZATION against the manually annotated test data.
2. Outline
Context of the experiments at the EHRI project
Description of the experiment
Corpus data
Creation and composition of the corpus
Results of the NE extraction
Conclusions
GCDH Colloquium 11.07.2012
3. Context in the EHRI project
Archival institutions have bigs amount of non digitized documents and
descriptions
EHRI will provide its partners an OCR service that:
Extracts text from image files of the documents
Text can be used to index the documents and improve the quality of
the search
Indexes can be later validated and improved by collection and
archive specialists
What kind of indexes can be obtained from this noisy text?
Quality of OCR transcripts in very low for humans, but is it useful for
machines?
GCDH Colloquium 11.07.2012
4. Experiment
Evaluation of four existing NE extraction tools:
Stanford NER
OpenCalais
OpenNLP
Alchemy
Extracted entity types: PER, LOC, ORG
Good coverage by the selected tools.
Highly relevant for Shoah research and contemporary historical
research in general.
GCDH Colloquium 11.07.2012
5. Experiment
Different tools use different annotation tagsets.
Output has to be normalized
Stanford NER and OpenNLP use Person, Location and Organization as
annotation categories.
Direct mapping to PER, LOC and ORG
OpenCalais:
Country, City and NaturalFeature merged into LOC
Organization and Facility into ORG
Alchemy
Organization, Facility and Company into ORG
City and Continent into LOC
GCDH Colloquium 11.07.2012
6. Corpus data
Two datasets of type-writting monospaced text
Wiener Library
17 pages of testimonies of Shoah survivors
OCR word accuracy 93%
King College London's Serving Soldier Archive
33 newsletters written for the crew of the warship H.M.S. Kelly
OCR word accuracy 92.5%
GCDH Colloquium 11.07.2012
8. Corpus data (WL)
蔵3o
had been sold, and we dependedgxhe last night of our stay on the
friendliness of this neighbour. III!! The landlord Mr.and Mrs.
Wolkewitz, who had always gone out of their way to be kind to us,
had a collection arranged to us, and_wn finally left - on the
night of July 4-5, 1939 - all the tenqnts or the house had
assembled, and we all cried.
All people mentioned so for have either been friends or
acqndintanoes. There were others e.g. the grocer and the laundry
who refused payment before our departure, end there are two
indidente with German officials which I would like to tell:
GCDH Colloquium 11.07.2012
10. Corpus data (KCL)
:_
損 I |- _
li; A 1 U g _:__ L, 贈g!g;' 損
K D. F. NEws.,p
No. 24,~ "Monday, 18th September, 1959.
KELLY at Sea. _ ' P
KINGSTQN at portsmouth, Remainder of "K" Flotilla building.
THE "K" D.E. NEwS IS NCT To EE TAKEN ASHCRE NCR ARE ANY or ITS
CONTENTS To EE CCNRUNICATED CUTSIEE THE SHIP UNTIL THE MAR IS
OVER, wHEN ARRANGEMENTS CAN EE MADE To SUPPLY BACE CCPIES PCR
THE PRICE CR THE PAPER oN WHICH THEY ARE PRINTED.
`________________________as--sauna-__-as-_un-_._-損_.__--.`蔵___.-_-
n__________..蔵.__
THE KELLY'S HUNT - SEPTENEER Ietn/Ivtn,
GCDH Colloquium 11.07.2012
11. Corpus data (KCL)
Although the events of Saturday night and Sunday
morning are Weil known to the KELLY shipis Company. they are
included here as being of interest to the rest of the Flotilla. `
Shortly after dark information was received which enabled
Course to be altered to close a German submarine on the surface.
Before the KELLY could arrive the submarine had dived, but a
Pemarkably good contact was obtained, and an att
C0ntact was maintained all night in order that the final attack
Sh0uld be carried out by daylight- Unfortunately no Oil, wreckage
'OP Survivors came to the surface, but air bUbb1S appeared after the
1&St attack, which makes it possible, although by no means certain,
that the submarine was destroyed. - _
THE KINGSTONS PROGRAIME. ~ -
Today the KINGSTON will be inspected by the Commander-
in~Chief, Portsmouth, and will then proceed to sea for acceptance
GCDH Colloquium 11.07.2012
12. Construction of the corpus
Generate two copies of each datasets
Manual correction of one of the copies
Used to evaluated the impact of the noise in the NE extraction
Tokenization and POS tagging using TreeTagger
Conversion of the TreeTagger output into stand-off standard XML.
Import of the data into the MMAX2 annotation tool
Manual annotation of the named entities
Control of reliability of the annotation using the Kappa coeficient
K = 0.93
K > 0.8 is considered as reliable
GCDH Colloquium 11.07.2012
13. Corpus data (KCL)
Wiener Library KCL
RAW Corrected Raw Corrected
Files 17 17 33 33
Words 4415 4398 16982 15693
PER 75 83 82 80
LOC 60 63 170 178
ORG 13 13 52 60
Total 148 159 305 319
GCDH Colloquium 11.07.2012
15. Results of the NE extraction
Raw Corrected
P R F1 P R F1
AL 0.61 0.38 0.47 0.63 0.38 0.48
OC 0.75 0.29 0.41 0.69 0.30 0.42
ON 0.42 0.12 0.19 0.53 0.13 0.21
ST 0.57 0.52 0.54 0.60 0.61 0.60
GCDH Colloquium 11.07.2012
16. Results of the NE extraction
Low performance of the tools in corrected and raw text
Our data and data used for training and evaluation of tools are quite
different.
PER: non standard forms as
[Last name, First name]
Wa1ter, Klaus
Parenthesis together with initials of the name
Captain (D)
Some cases can be resolved using easy heuristics in preprocessing
Names of persons and locations are used for other kind of entities:
Warships have been annotated as PER
GCDH Colloquium 11.07.2012
17. Results of the NE extraction
Performance of extraction of entities of type ORG is very low
F1 = between 0.11 & 0.32
Name of organizations appear in non-standard forms
Some of the organization don't exists and are not part of the
knowledge used to train the system.
SS and other relevant nazi organizations have not be detected.
Spelling errors and typos in the original files:
OpenCalais used general knowledge to resolve this problem
Use of general knowledge my be problematic.
Klan, Walter Ku Klux Klan
GCDH Colloquium 11.07.2012
18. Conclusions
Manual correction of OCR output does not improve significantly the
performance.
Raw output is enough to obtain provisional index candidates
Focus in near tearm:
Identify most habitual patterns of error
Implement preprocessing pipeline using simple heuristics and
pattern matching tools
Focus in longer term:
Use domain specific knowledge in form of authority files to validate
and correct the output of NE extraction tools.
Explore the possibility of combining different NE extraction tools
and select output using a voting algorithm
GCDH Colloquium 11.07.2012