The document discusses approximating subgraph matching for detecting variations in topics. It presents a method for identifying a common template from a collection of texts on a topic by aligning the texts to the template. Templates are represented as subgraphs with frequent specializations. The method constructs a semantic graph from news data and then performs subgraph matching and generalization/specialization to mine templates representing event schemas. Preliminary results on 5 test domains showed the approach could extract templates in 10-60 seconds.
1. Approximate subgraph matching
for detection of topic variations
Mitja Trampu邸
Dunja Mladeni
AI Lab, Jo転ef Stefan Institute, Slovenia
ailab.ijs.si
2. Mining Diversity
Web content varies in many aspects, e.g.
o Topical
o Social (author, target audience, people written about)
o Geographical (publisher, places written about)
o Sentiment (positive/negative)
o Writing style (structure, vocabulary)
o Coverage bias
This work: (micro-)topical diversity
o Macroscopic = largely solved
o Microscopic = challenge
AI Lab, Jozef Stefan Institute
ailab.ijs.si
3. Task:
Given a collection of texts on a topic,
identify a common template
align texts to the template
AI Lab, Jozef Stefan Institute
ailab.ijs.si
4. Template representation
Syntactic
o info1: X people were killed / killed X people / resulted in X
casualties
o info2: blew up Y / destroyed Y / attacked in a Y
Semantic
o kill(bomber, people); count(people, X)
o destroy(bomber, Y)
X count
kill bomber destroy Y
people
AI Lab, Jozef Stefan Institute
ailab.ijs.si
5. car
drive
12 count
kill suicide destroychurch
believers
bomber
exit run wear vest
Semantic Graph
treatment execute attack
Prerequisite:
receive withstand
100 count
kill terroristdemolish
hospital
patients
2 count
police slaughter blow uppolice
bomber
officer station
AI Lab, Jozef Stefan Institute
ailab.ijs.si
6. Mining Templates
Template := subgraph with frequent
specializations
o Specializations implied by background taxonomy
(WordNet)
o Threshold frequency manually defined
X count
kill bomberdestroy building
people
12 tear
count suicide down 2
believers kill church count
police slaughter police
blow up
bomber bomber
officer station
AI Lab, Jozef Stefan Institute
ailab.ijs.si
7. Semantic Graph Construction
1. Data: Google News crawl
2. HTML cleanup
3. Named entity tagging
4. Pronoun resolution (he/she/him/her)
5. Named entity consolidation (Barack Obama
vs President Obama)
6. Parsing, triple/fact/assertion extraction
(for now: subj-verb-obj only)
7. Ontology/taxonomy alignment
8. Merging triples into a graph
AI Lab, Jozef Stefan Institute
ailab.ijs.si
8. Approximate subgraph matching
number count
kill person destroy
location
people
FREQUENT
SUBTREE MINING
count
number
people kill person destroylocation number count destroy
kill person location
people
GENERALIZE
12 count tear
suicide down 2 count slaughter blow up
believers kill church police police
bomber bomber
officer station
AI Lab, Jozef Stefan Institute
ailab.ijs.si
9. Approximate subgraph matching
number count
kill person destroy
location
people
SPECIALIZE
number count
kill bomberdestroy
building
people
12 tear
count suicide down 2
believers kill church count
police slaughter police
blow up
bomber bomber
officer station
AI Lab, Jozef Stefan Institute
ailab.ijs.si
10. Preliminary results
5 test domains; for each:
o ~10 graphs, ~10000 nodes
o 10-60 seconds
At min. support 30%
o 20 maximal patterns, 9 manually judged as interesting
AI Lab, Jozef Stefan Institute
ailab.ijs.si
12. Conclusion
Future work:
o Mapping text -> semantics
Other ontologies?
o Interestingness measure for assertions and patterns
o Evaluation (precision, recall; multiple domains)
o Alternative approaches to generalizing subgraphs
Template extraction is achievable,
but not easy
Human filtering of results hard to avoid
Current approach reasonably fast
AI Lab, Jozef Stefan Institute
ailab.ijs.si
14. Can we extract all relations?
Kind of
Thousands of small quakes
resumed 18 months ago and
continue to rattle Mammoth
Lakes, June Lake and other Mono
County resort towns. The
temblors, most measuring 1 to 3
on the Richter scale, started
beneath Mammoth Mountain.
Subject Verb Object
Triplets
Semantic Graph
16. Templates - why
Interpret content
o news archives: structure/annotate old texts, enable
semantic search
o wikipedia: suggestions for infobox entries
Generate content
o wikipedia: a starting point for new articles / a checklist of
information to be included
No normative definition of good Jozef Stefanailab.ijs.si
AI Lab,
template
Institute
17. Evaluation
Qualitative
o Usage-specific
o Not useful for tuning algorithms
Quantitative
o Precision
o Recall
AI Lab, Jozef Stefan Institute
ailab.ijs.si