Good dictionaries are a key for text mining. We present an idea to build a platform where users can create their own dictionary and text-mining pipeline.
4. Use case Semantic type
Dictionary
type
Document
type
Section Metadata Delivery method
OpenAIRE
accession
numbers
pattern
(e.g, [0-9][A-
Za-z0-9]{3})
patents
Title, Claim,
Description,
Abstract, Figure,
Table
Pubyear,
IPCR
summary table
ERC grant identifiers pattern articles Acknowledgements search index
CTTV gene, disease
term
(e.g., IBD)
articles,
abstracts
json
ELIXIR-EXCELERTAE resource names term articles summary table
1000 Genomes cell line names pattern articles !Acknowledgements REST API
Wikipedia
accession
numbers
pattern wikipages summary table
KEW Garden
species names
(muitilingual)
term articles summary table
ChEMBL resource name term articles
Author,
Journal
summary table
Ensembl genomic range pattern articles summary table
A long list of requests
5. Scalable Text Mining
For the last few years, were having a pipeline crisis!
A long list of requests and our slow responses
Makes you unhappy.
Even worse, its a long tail!
Never the same pipeline used for each request.
Every time, we have to build a new pipeline.
We need a new approach to solve this crisis.
6. Objective
We want to build a LEGO-like platform that helps you to
build your own text-mining pipeline and your own
dictionary.
7. A Key Block: Dictionary-Based Tagger
Role: To identify names (e.g., proteins, species,
accession numbers, etc.)
Dictionary-based approach for mining names.
Simple
Readable
Interactive
Building a dictionary is a VERY iterative process
20% for building an initial dictionary and the rest for
refining it.
Good dictionaries are a key for text-mining success
stories.
10. Pipeline Stories
CTTV
As a researcher, I want to find articles with
supporting evidence from drug discovery
ERC
As a funder, I want to funded articles more
searchable.
ELIXIR-EXCELERATE
As a resource manager, I want to know impacts of
resources.
11. Second, Find & Describe Blocks You Need
When you want You can use
to extract a sentence Sentence splitter
to limit your mining to an article section Section tagger
to identify disease names
to identify database idetifiers
Dictionary-based tagger
to find relations between genes and diseases Relation extractor
to get some analytics Summary table generator
to get article meta data Europe PMC REST API
to produce text-mined data in RDF RDF generator
14. How to Revise a Dictionary?
We want to build an expressive language for filtering.
Global filtering rule
A length of term > 2
Case sensitive
Per-entry filtering rule
A term should be tagged when it is mentioned in
Methods section.
A pattern should be tagged when it follows a term
omim
Blacklist: e.g., stop words
15. Per-Entry Rules
A spreadsheet per entry
Definitions
Context: should (not) be after a tem.
Section: should (not) be mentioned a section.
URI: check if http://www.ebi.ac.
uk/efo/EFO_0001997 exists
Entry information Filtering rules
Term/Pattern Entry ID DB Context Section URI
Pattern HG[0-9]{5}
1000
genomes
!
(grant|fun
d)
!ACK
Term basal cell EFO_0001997 efo Methods Yes
16. Analytics
Summary table
Top 100 frequent terms
PMCID Term ID Frequency
PMCID4698870 Nutlin-3 ChEBI:46742 16
PMCID4698870 cell cycle arrests GO:0007050 6
Top Name Document Freq. Collection Freq.
1 protein 678,987 1,823,783
2 water 563,234 1,233,332
18. Wrap Up
What is your pipeline story?
Have you managed to create your own dictionary?
What service blocks are missing?
What should be the interfaces?
How should we deliver?