
際際滷Share a Scribd company logo
Named entity extraction tools for raw OCR text
      Kepa J. Rodriguez

   Context of the experiments at the EHRI project
   Description of the experiment
   Corpus data
   Creation and composition of the corpus
   Results of the NE extraction

                            GCDH Colloquium  11.07.2012
Context in the EHRI project

   Archival institutions have bigs amount of non digitized documents and
   EHRI will provide its partners an OCR service that:
      Extracts text from image files of the documents
      Text can be used to index the documents and improve the quality of
       the search
      Indexes can be later validated and improved by collection and
       archive specialists
   What kind of indexes can be obtained from this noisy text?
   Quality of OCR transcripts in very low for humans, but  is it useful for

                             GCDH Colloquium  11.07.2012

   Evaluation of four existing NE extraction tools:
      Stanford NER

   Extracted entity types: PER, LOC, ORG
      Good coverage by the selected tools.
      Highly relevant for Shoah research and contemporary historical
        research in general.

                              GCDH Colloquium  11.07.2012

   Different tools use different annotation tagsets.
      Output has to be normalized
   Stanford NER and OpenNLP use Person, Location and Organization as
    annotation categories.
      Direct mapping to PER, LOC and ORG
      Country, City and NaturalFeature merged into LOC
      Organization and Facility into ORG
      Organization, Facility and Company into ORG
      City and Continent into LOC

                          GCDH Colloquium  11.07.2012
Corpus data

   Two datasets of type-writting monospaced text
   Wiener Library
      17 pages of testimonies of Shoah survivors
      OCR word accuracy 93%
   King College London's Serving Soldier Archive
      33 newsletters written for the crew of the warship H.M.S. Kelly
      OCR word accuracy 92.5%

                             GCDH Colloquium  11.07.2012
Corpus data (WL)

                   GCDH Colloquium  11.07.2012
Corpus data (WL)

had been sold, and we dependedgxhe last night of our stay on the
friendliness of this neighbour. III!! The landlord Mr.and Mrs.
Wolkewitz, who had always gone out of their way to be kind to us,
had a collection arranged to us, and_wn finally left - on the
night of July 4-5, 1939 - all the tenqnts or the house had
assembled, and we all cried.
All people mentioned so for have either been friends or
acqndintanoes. There were others e.g. the grocer and the laundry
who refused payment before our departure, end there are two
indidente with German officials which I would like to tell:

                            GCDH Colloquium  11.07.2012
Corpus data (KCL)

                    GCDH Colloquium  11.07.2012
Corpus data (KCL)

損 I |- _
li; A 1 U g  _:__ L, 贈g!g;' 損
K D. F. NEws.,p
No. 24,~ "Monday, 18th September, 1959.
KELLY at Sea. _ ' P
KINGSTQN at portsmouth, Remainder of "K" Flotilla building.

                         GCDH Colloquium  11.07.2012
Corpus data (KCL)
Although the events of Saturday night and Sunday
morning are Weil known to the KELLY shipis Company. they are
included here as being of interest to the rest of the Flotilla. `
Shortly after dark information was received which enabled
Course to be altered to close a German submarine on the surface.
Before the KELLY could arrive the submarine had dived, but a
Pemarkably good contact was obtained, and an att
C0ntact was maintained all night in order that the final attack
Sh0uld be carried out by daylight- Unfortunately no Oil, wreckage
'OP Survivors came to the surface, but air bUbb1S appeared after the
1&St attack, which makes it possible, although by no means certain,
that the submarine was destroyed. - _
Today the KINGSTON will be inspected by the Commander-
in~Chief, Portsmouth, and will then proceed to sea for acceptance

                               GCDH Colloquium  11.07.2012
Construction of the corpus

   Generate two copies of each datasets
   Manual correction of one of the copies
      Used to evaluated the impact of the noise in the NE extraction
   Tokenization and POS tagging using TreeTagger
   Conversion of the TreeTagger output into stand-off standard XML.
   Import of the data into the MMAX2 annotation tool
   Manual annotation of the named entities
   Control of reliability of the annotation using the Kappa coeficient

   K = 0.93
   K > 0.8 is considered as reliable

                             GCDH Colloquium  11.07.2012
Corpus data (KCL)

            Wiener Library                           KCL

            RAW                     Corrected        Raw     Corrected

Files       17                      17               33      33
Words       4415                    4398             16982   15693
PER         75                      83               82      80
LOC         60                      63               170     178
ORG         13                      13               52      60
Total       148                     159              305     319

                      GCDH Colloquium  11.07.2012
Results of the NE extraction

                     GCDH Colloquium  11.07.2012
Results of the NE extraction

         Raw                                         Corrected

                 P       R                 F1                    P     R     F1

  AL           0.61   0.38              0.47               0.63      0.38   0.48
  OC           0.75   0.29              0.41               0.69      0.30   0.42
  ON           0.42   0.12              0.19               0.53      0.13   0.21
  ST           0.57   0.52              0.54               0.60      0.61   0.60

                      GCDH Colloquium  11.07.2012
Results of the NE extraction

    Low performance of the tools in corrected and raw text
    Our data and data used for training and evaluation of tools are quite
    PER: non standard forms as
       [Last name, First name]
             Wa1ter, Klaus
       Parenthesis together with initials of the name
             Captain (D)
       Some cases can be resolved using easy heuristics in preprocessing
    Names of persons and locations are used for other kind of entities:
             Warships have been annotated as PER

                             GCDH Colloquium  11.07.2012
Results of the NE extraction

   Performance of extraction of entities of type ORG is very low
      F1 = between 0.11 & 0.32
      Name of organizations appear in non-standard forms
      Some of the organization don't exists and are not part of the
        knowledge used to train the system.
           SS and other relevant nazi organizations have not be detected.
   Spelling errors and typos in the original files:
      OpenCalais used general knowledge to resolve this problem
      Use of general knowledge my be problematic.
           Klan, Walter  Ku Klux Klan

                            GCDH Colloquium  11.07.2012

   Manual correction of OCR output does not improve significantly the
      Raw output is enough to obtain provisional index candidates
   Focus in near tearm:
      Identify most habitual patterns of error
      Implement preprocessing pipeline using simple heuristics and
        pattern matching tools
   Focus in longer term:
      Use domain specific knowledge in form of authority files to validate
        and correct the output of NE extraction tools.
      Explore the possibility of combining different NE extraction tools
        and select output using a voting algorithm

                             GCDH Colloquium  11.07.2012

GCDH Colloquium  11.07.2012

More Related Content

Named entity extraction tools for raw OCR text

  • 1. Named entity extraction tools for raw OCR text Kepa J. Rodriguez GCDH-colloquium 04.07.2012
  • 2. Outline Context of the experiments at the EHRI project Description of the experiment Corpus data Creation and composition of the corpus Results of the NE extraction Conclusions GCDH Colloquium 11.07.2012
  • 3. Context in the EHRI project Archival institutions have bigs amount of non digitized documents and descriptions EHRI will provide its partners an OCR service that: Extracts text from image files of the documents Text can be used to index the documents and improve the quality of the search Indexes can be later validated and improved by collection and archive specialists What kind of indexes can be obtained from this noisy text? Quality of OCR transcripts in very low for humans, but is it useful for machines? GCDH Colloquium 11.07.2012
  • 4. Experiment Evaluation of four existing NE extraction tools: Stanford NER OpenCalais OpenNLP Alchemy Extracted entity types: PER, LOC, ORG Good coverage by the selected tools. Highly relevant for Shoah research and contemporary historical research in general. GCDH Colloquium 11.07.2012
  • 5. Experiment Different tools use different annotation tagsets. Output has to be normalized Stanford NER and OpenNLP use Person, Location and Organization as annotation categories. Direct mapping to PER, LOC and ORG OpenCalais: Country, City and NaturalFeature merged into LOC Organization and Facility into ORG Alchemy Organization, Facility and Company into ORG City and Continent into LOC GCDH Colloquium 11.07.2012
  • 6. Corpus data Two datasets of type-writting monospaced text Wiener Library 17 pages of testimonies of Shoah survivors OCR word accuracy 93% King College London's Serving Soldier Archive 33 newsletters written for the crew of the warship H.M.S. Kelly OCR word accuracy 92.5% GCDH Colloquium 11.07.2012
  • 7. Corpus data (WL) GCDH Colloquium 11.07.2012
  • 8. Corpus data (WL) 蔵3o had been sold, and we dependedgxhe last night of our stay on the friendliness of this neighbour. III!! The landlord Mr.and Mrs. Wolkewitz, who had always gone out of their way to be kind to us, had a collection arranged to us, and_wn finally left - on the night of July 4-5, 1939 - all the tenqnts or the house had assembled, and we all cried. All people mentioned so for have either been friends or acqndintanoes. There were others e.g. the grocer and the laundry who refused payment before our departure, end there are two indidente with German officials which I would like to tell: GCDH Colloquium 11.07.2012
  • 9. Corpus data (KCL) GCDH Colloquium 11.07.2012
  • 10. Corpus data (KCL) :_ 損 I |- _ li; A 1 U g _:__ L, 贈g!g;' 損 K D. F. NEws.,p No. 24,~ "Monday, 18th September, 1959. KELLY at Sea. _ ' P KINGSTQN at portsmouth, Remainder of "K" Flotilla building. THE "K" D.E. NEwS IS NCT To EE TAKEN ASHCRE NCR ARE ANY or ITS CONTENTS To EE CCNRUNICATED CUTSIEE THE SHIP UNTIL THE MAR IS OVER, wHEN ARRANGEMENTS CAN EE MADE To SUPPLY BACE CCPIES PCR THE PRICE CR THE PAPER oN WHICH THEY ARE PRINTED. `________________________as--sauna-__-as-_un-_._-損_.__--.`蔵___.-_- n__________..蔵.__ THE KELLY'S HUNT - SEPTENEER Ietn/Ivtn, GCDH Colloquium 11.07.2012
  • 11. Corpus data (KCL) Although the events of Saturday night and Sunday morning are Weil known to the KELLY shipis Company. they are included here as being of interest to the rest of the Flotilla. ` Shortly after dark information was received which enabled Course to be altered to close a German submarine on the surface. Before the KELLY could arrive the submarine had dived, but a Pemarkably good contact was obtained, and an att C0ntact was maintained all night in order that the final attack Sh0uld be carried out by daylight- Unfortunately no Oil, wreckage 'OP Survivors came to the surface, but air bUbb1S appeared after the 1&St attack, which makes it possible, although by no means certain, that the submarine was destroyed. - _ THE KINGSTONS PROGRAIME. ~ - Today the KINGSTON will be inspected by the Commander- in~Chief, Portsmouth, and will then proceed to sea for acceptance GCDH Colloquium 11.07.2012
  • 12. Construction of the corpus Generate two copies of each datasets Manual correction of one of the copies Used to evaluated the impact of the noise in the NE extraction Tokenization and POS tagging using TreeTagger Conversion of the TreeTagger output into stand-off standard XML. Import of the data into the MMAX2 annotation tool Manual annotation of the named entities Control of reliability of the annotation using the Kappa coeficient K = 0.93 K > 0.8 is considered as reliable GCDH Colloquium 11.07.2012
  • 13. Corpus data (KCL) Wiener Library KCL RAW Corrected Raw Corrected Files 17 17 33 33 Words 4415 4398 16982 15693 PER 75 83 82 80 LOC 60 63 170 178 ORG 13 13 52 60 Total 148 159 305 319 GCDH Colloquium 11.07.2012
  • 14. Results of the NE extraction GCDH Colloquium 11.07.2012
  • 15. Results of the NE extraction Raw Corrected P R F1 P R F1 AL 0.61 0.38 0.47 0.63 0.38 0.48 OC 0.75 0.29 0.41 0.69 0.30 0.42 ON 0.42 0.12 0.19 0.53 0.13 0.21 ST 0.57 0.52 0.54 0.60 0.61 0.60 GCDH Colloquium 11.07.2012
  • 16. Results of the NE extraction Low performance of the tools in corrected and raw text Our data and data used for training and evaluation of tools are quite different. PER: non standard forms as [Last name, First name] Wa1ter, Klaus Parenthesis together with initials of the name Captain (D) Some cases can be resolved using easy heuristics in preprocessing Names of persons and locations are used for other kind of entities: Warships have been annotated as PER GCDH Colloquium 11.07.2012
  • 17. Results of the NE extraction Performance of extraction of entities of type ORG is very low F1 = between 0.11 & 0.32 Name of organizations appear in non-standard forms Some of the organization don't exists and are not part of the knowledge used to train the system. SS and other relevant nazi organizations have not be detected. Spelling errors and typos in the original files: OpenCalais used general knowledge to resolve this problem Use of general knowledge my be problematic. Klan, Walter Ku Klux Klan GCDH Colloquium 11.07.2012
  • 18. Conclusions Manual correction of OCR output does not improve significantly the performance. Raw output is enough to obtain provisional index candidates Focus in near tearm: Identify most habitual patterns of error Implement preprocessing pipeline using simple heuristics and pattern matching tools Focus in longer term: Use domain specific knowledge in form of authority files to validate and correct the output of NE extraction tools. Explore the possibility of combining different NE extraction tools and select output using a voting algorithm GCDH Colloquium 11.07.2012