
際際滷Share a Scribd company logo
Scalable Text Mining
Jee-Hyub Kim
Text-Mining Pipeline Builder
Literature Services Team
2 Feb 2016
A Text-Mining Pipeline
 Text-Mining Pipeline Crisis
 Session 1: Build Your Own Pipeline
 Session 2: Build Your Own Dictionary
 Wrap Up
Use case Semantic type
Section Metadata Delivery method
(e.g, [0-9][A-
Title, Claim,
Abstract, Figure,
summary table
ERC grant identifiers pattern articles Acknowledgements search index
CTTV gene, disease
(e.g., IBD)
ELIXIR-EXCELERTAE resource names term articles summary table
1000 Genomes cell line names pattern articles !Acknowledgements REST API
pattern wikipages summary table
KEW Garden
species names
term articles summary table
ChEMBL resource name term articles
summary table
Ensembl genomic range pattern articles summary table
A long list of requests
Scalable Text Mining
 For the last few years, were having a pipeline crisis!
 A long list of requests and our slow responses
 Makes you unhappy.
 Even worse, its a long tail!
 Never the same pipeline used for each request.
 Every time, we have to build a new pipeline.
 We need a new approach to solve this crisis.
 We want to build a LEGO-like platform that helps you to
build your own text-mining pipeline and your own
A Key Block: Dictionary-Based Tagger
 Role: To identify names (e.g., proteins, species,
accession numbers, etc.)
 Dictionary-based approach for mining names.
 Building a dictionary is a VERY iterative process
 20% for building an initial dictionary and the rest for
refining it.
 Good dictionaries are a key for text-mining success
Agile Revision Process
Session 1
Build Your Own Pipeline
As , I want a pipeline to do ...
Pipeline Stories
 As a researcher, I want to find articles with
supporting evidence from drug discovery
 As a funder, I want to funded articles more
 As a resource manager, I want to know impacts of
Second, Find & Describe Blocks You Need
When you want You can use
to extract a sentence Sentence splitter
to limit your mining to an article section Section tagger
to identify disease names
to identify database idetifiers
Dictionary-based tagger
to find relations between genes and diseases Relation extractor
to get some analytics Summary table generator
to get article meta data Europe PMC REST API
to produce text-mined data in RDF RDF generator
Then, Build a Pipeline using Blocks
Session 2
Build Your Own Dictionary
Designing filtering rules
How to Revise a Dictionary?
 We want to build an expressive language for filtering.
 Global filtering rule
 A length of term > 2
 Case sensitive
 Per-entry filtering rule
 A term should be tagged when it is mentioned in
Methods section.
 A pattern should be tagged when it follows a term
 Blacklist: e.g., stop words
Per-Entry Rules
 A spreadsheet per entry
 Context: should (not) be after a tem.
 Section: should (not) be mentioned a section.
 URI: check if http://www.ebi.ac.
uk/efo/EFO_0001997 exists
Entry information Filtering rules
Term/Pattern Entry ID DB Context Section URI
Pattern HG[0-9]{5}
Term basal cell EFO_0001997 efo Methods Yes
 Summary table
 Top 100 frequent terms
PMCID Term ID Frequency
PMCID4698870 Nutlin-3 ChEBI:46742 16
PMCID4698870 cell cycle arrests GO:0007050 6
Top Name Document Freq. Collection Freq.
1 protein 678,987 1,823,783
2 water 563,234 1,233,332
Spreadsheet for Filtering Rules
Wrap Up
 What is your pipeline story?
 Have you managed to create your own dictionary?
 What service blocks are missing?
 What should be the interfaces?
 How should we deliver?

More Related Content

Scalable Text Mining

  • 1. Scalable Text Mining Jee-Hyub Kim Text-Mining Pipeline Builder Literature Services Team 2 Feb 2016
  • 3. Contents Text-Mining Pipeline Crisis Session 1: Build Your Own Pipeline Session 2: Build Your Own Dictionary Wrap Up
  • 4. Use case Semantic type Dictionary type Document type Section Metadata Delivery method OpenAIRE accession numbers pattern (e.g, [0-9][A- Za-z0-9]{3}) patents Title, Claim, Description, Abstract, Figure, Table Pubyear, IPCR summary table ERC grant identifiers pattern articles Acknowledgements search index CTTV gene, disease term (e.g., IBD) articles, abstracts json ELIXIR-EXCELERTAE resource names term articles summary table 1000 Genomes cell line names pattern articles !Acknowledgements REST API Wikipedia accession numbers pattern wikipages summary table KEW Garden species names (muitilingual) term articles summary table ChEMBL resource name term articles Author, Journal summary table Ensembl genomic range pattern articles summary table A long list of requests
  • 5. Scalable Text Mining For the last few years, were having a pipeline crisis! A long list of requests and our slow responses Makes you unhappy. Even worse, its a long tail! Never the same pipeline used for each request. Every time, we have to build a new pipeline. We need a new approach to solve this crisis.
  • 6. Objective We want to build a LEGO-like platform that helps you to build your own text-mining pipeline and your own dictionary.
  • 7. A Key Block: Dictionary-Based Tagger Role: To identify names (e.g., proteins, species, accession numbers, etc.) Dictionary-based approach for mining names. Simple Readable Interactive Building a dictionary is a VERY iterative process 20% for building an initial dictionary and the rest for refining it. Good dictionaries are a key for text-mining success stories.
  • 9. Session 1 Build Your Own Pipeline As , I want a pipeline to do ...
  • 10. Pipeline Stories CTTV As a researcher, I want to find articles with supporting evidence from drug discovery ERC As a funder, I want to funded articles more searchable. ELIXIR-EXCELERATE As a resource manager, I want to know impacts of resources.
  • 11. Second, Find & Describe Blocks You Need When you want You can use to extract a sentence Sentence splitter to limit your mining to an article section Section tagger to identify disease names to identify database idetifiers Dictionary-based tagger to find relations between genes and diseases Relation extractor to get some analytics Summary table generator to get article meta data Europe PMC REST API to produce text-mined data in RDF RDF generator
  • 12. Then, Build a Pipeline using Blocks
  • 13. Session 2 Build Your Own Dictionary Designing filtering rules
  • 14. How to Revise a Dictionary? We want to build an expressive language for filtering. Global filtering rule A length of term > 2 Case sensitive Per-entry filtering rule A term should be tagged when it is mentioned in Methods section. A pattern should be tagged when it follows a term omim Blacklist: e.g., stop words
  • 15. Per-Entry Rules A spreadsheet per entry Definitions Context: should (not) be after a tem. Section: should (not) be mentioned a section. URI: check if http://www.ebi.ac. uk/efo/EFO_0001997 exists Entry information Filtering rules Term/Pattern Entry ID DB Context Section URI Pattern HG[0-9]{5} 1000 genomes ! (grant|fun d) !ACK Term basal cell EFO_0001997 efo Methods Yes
  • 16. Analytics Summary table Top 100 frequent terms PMCID Term ID Frequency PMCID4698870 Nutlin-3 ChEBI:46742 16 PMCID4698870 cell cycle arrests GO:0007050 6 Top Name Document Freq. Collection Freq. 1 protein 678,987 1,823,783 2 water 563,234 1,233,332
  • 17. Spreadsheet for Filtering Rules http://tinyurl.com/zlwbx2y
  • 18. Wrap Up What is your pipeline story? Have you managed to create your own dictionary? What service blocks are missing? What should be the interfaces? How should we deliver?