際際滷

際際滷Share a Scribd company logo
Approximate subgraph matching
 for detection of topic variations

                Mitja Trampu邸
                Dunja Mladeni
                AI Lab, Jo転ef Stefan Institute, Slovenia




                                        ailab.ijs.si
Mining Diversity
 Web content varies in many aspects, e.g.
  o   Topical
  o   Social (author, target audience, people written about)
  o   Geographical (publisher, places written about)
  o   Sentiment (positive/negative)
  o   Writing style (structure, vocabulary)
  o   Coverage bias


 This work: (micro-)topical diversity
  o Macroscopic = largely solved
  o Microscopic = challenge

                                            AI Lab, Jozef Stefan Institute
                                                               ailab.ijs.si
Task:
Given a collection of texts on a topic,
 identify a common template
 align texts to the template




                                 AI Lab, Jozef Stefan Institute
                                                    ailab.ijs.si
Template representation
 Syntactic
  o info1: X people were killed / killed X people / resulted in X
    casualties
  o info2: blew up Y / destroyed Y / attacked in a Y


 Semantic
  o kill(bomber, people); count(people, X)
  o destroy(bomber, Y)



             X   count
                              kill bomber destroy     Y
                     people
                                             AI Lab, Jozef Stefan Institute
                                                                ailab.ijs.si
car
                                             drive
                 12      count
                                      kill suicide destroychurch
                            believers
                                           bomber
                      exit run                wear vest
Semantic Graph

                                  treatment           execute             attack
Prerequisite:



                               receive                        withstand
                 100    count
                                         kill terroristdemolish
                                                              hospital
                            patients




                  2     count
                             police slaughter     blow uppolice
                                            bomber
                             officer                     station

                                                     AI Lab, Jozef Stefan Institute
                                                                        ailab.ijs.si
Mining Templates
  Template := subgraph with frequent
               specializations
     o Specializations implied by background taxonomy
       (WordNet)
     o Threshold frequency manually defined

                 X      count
                                        kill bomberdestroy building
                            people




12                             tear
     count             suicide down              2
        believers kill              church           count
                                                          police slaughter           police
                                                                               blow up
                       bomber                                            bomber
                                                         officer                     station
                                                         AI Lab, Jozef Stefan Institute
                                                                            ailab.ijs.si
Semantic Graph Construction
1. Data: Google News crawl
2. HTML cleanup
3. Named entity tagging
4. Pronoun resolution (he/she/him/her)
5. Named entity consolidation (Barack Obama
   vs President Obama)
6. Parsing, triple/fact/assertion extraction
   (for now: subj-verb-obj only)
7. Ontology/taxonomy alignment
8. Merging triples into a graph
                             AI Lab, Jozef Stefan Institute
                                                ailab.ijs.si
Approximate subgraph matching
                       number count
                                              kill person destroy
                                                                location
                                  people




                                                                     FREQUENT
                                                                      SUBTREE MINING

         count
number
             people   kill person destroylocation      number count                            destroy
                                                                                kill person           location
                                                                  people




                                                                                         GENERALIZE


  12     count                     tear
                           suicide down                   2    count      slaughter      blow up
            believers kill              church                      police                     police
                           bomber                                                  bomber
                                                                   officer                     station
                                                                    AI Lab, Jozef Stefan Institute
                                                                                       ailab.ijs.si
Approximate subgraph matching
                   number count
                                        kill person destroy
                                                          location
                              people




                                                  SPECIALIZE


                   number count
                                        kill bomberdestroy
                                                         building
                              people




12                             tear
     count             suicide down                 2
        believers kill              church                count
                                                               police slaughter           police
                                                                                    blow up
                       bomber                                                 bomber
                                                              officer                     station
                                                              AI Lab, Jozef Stefan Institute
                                                                                 ailab.ijs.si
Preliminary results
 5 test domains; for each:
  o ~10 graphs, ~10000 nodes
  o 10-60 seconds


 At min. support 30%
  o 20 maximal patterns, 9 manually judged as interesting




                                          AI Lab, Jozef Stefan Institute
                                                             ailab.ijs.si
AI Lab, Jozef Stefan Institute
                   ailab.ijs.si
Conclusion
 Future work:
  o Mapping text -> semantics
       Other ontologies?
  o Interestingness measure for assertions and patterns
  o Evaluation (precision, recall; multiple domains)
  o Alternative approaches to generalizing subgraphs


 Template extraction is achievable,
  but not easy
 Human filtering of results hard to avoid
 Current approach reasonably fast
                                           AI Lab, Jozef Stefan Institute
                                                              ailab.ijs.si
Thank you.
AI Lab, Jozef Stefan Institute
                   ailab.ijs.si
Can we extract all relations?
                           Kind of 
Thousands of small quakes
resumed 18 months ago and
continue to rattle Mammoth
Lakes, June Lake and other Mono
County resort towns. The
temblors, most measuring 1 to 3
on the Richter scale, started
beneath Mammoth Mountain.

Subject  Verb  Object
       Triplets

                            Semantic Graph
AI Lab, Jozef Stefan Institute
                   ailab.ijs.si
Templates - why
 Interpret content
   o news archives: structure/annotate old texts, enable
     semantic search
   o wikipedia: suggestions for infobox entries


 Generate content
   o wikipedia: a starting point for new articles / a checklist of
     information to be included




 No normative definition of good Jozef Stefanailab.ijs.si
                                AI Lab,
                                        template
                                                Institute
Evaluation
 Qualitative
  o Usage-specific
  o Not useful for tuning algorithms


 Quantitative
  o Precision
  o Recall




                                       AI Lab, Jozef Stefan Institute
                                                          ailab.ijs.si

More Related Content

Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

  • 1. Approximate subgraph matching for detection of topic variations Mitja Trampu邸 Dunja Mladeni AI Lab, Jo転ef Stefan Institute, Slovenia ailab.ijs.si
  • 2. Mining Diversity Web content varies in many aspects, e.g. o Topical o Social (author, target audience, people written about) o Geographical (publisher, places written about) o Sentiment (positive/negative) o Writing style (structure, vocabulary) o Coverage bias This work: (micro-)topical diversity o Macroscopic = largely solved o Microscopic = challenge AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 3. Task: Given a collection of texts on a topic, identify a common template align texts to the template AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 4. Template representation Syntactic o info1: X people were killed / killed X people / resulted in X casualties o info2: blew up Y / destroyed Y / attacked in a Y Semantic o kill(bomber, people); count(people, X) o destroy(bomber, Y) X count kill bomber destroy Y people AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 5. car drive 12 count kill suicide destroychurch believers bomber exit run wear vest Semantic Graph treatment execute attack Prerequisite: receive withstand 100 count kill terroristdemolish hospital patients 2 count police slaughter blow uppolice bomber officer station AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 6. Mining Templates Template := subgraph with frequent specializations o Specializations implied by background taxonomy (WordNet) o Threshold frequency manually defined X count kill bomberdestroy building people 12 tear count suicide down 2 believers kill church count police slaughter police blow up bomber bomber officer station AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 7. Semantic Graph Construction 1. Data: Google News crawl 2. HTML cleanup 3. Named entity tagging 4. Pronoun resolution (he/she/him/her) 5. Named entity consolidation (Barack Obama vs President Obama) 6. Parsing, triple/fact/assertion extraction (for now: subj-verb-obj only) 7. Ontology/taxonomy alignment 8. Merging triples into a graph AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 8. Approximate subgraph matching number count kill person destroy location people FREQUENT SUBTREE MINING count number people kill person destroylocation number count destroy kill person location people GENERALIZE 12 count tear suicide down 2 count slaughter blow up believers kill church police police bomber bomber officer station AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 9. Approximate subgraph matching number count kill person destroy location people SPECIALIZE number count kill bomberdestroy building people 12 tear count suicide down 2 believers kill church count police slaughter police blow up bomber bomber officer station AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 10. Preliminary results 5 test domains; for each: o ~10 graphs, ~10000 nodes o 10-60 seconds At min. support 30% o 20 maximal patterns, 9 manually judged as interesting AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 11. AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 12. Conclusion Future work: o Mapping text -> semantics Other ontologies? o Interestingness measure for assertions and patterns o Evaluation (precision, recall; multiple domains) o Alternative approaches to generalizing subgraphs Template extraction is achievable, but not easy Human filtering of results hard to avoid Current approach reasonably fast AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 13. Thank you. AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 14. Can we extract all relations? Kind of Thousands of small quakes resumed 18 months ago and continue to rattle Mammoth Lakes, June Lake and other Mono County resort towns. The temblors, most measuring 1 to 3 on the Richter scale, started beneath Mammoth Mountain. Subject Verb Object Triplets Semantic Graph
  • 15. AI Lab, Jozef Stefan Institute ailab.ijs.si
  • 16. Templates - why Interpret content o news archives: structure/annotate old texts, enable semantic search o wikipedia: suggestions for infobox entries Generate content o wikipedia: a starting point for new articles / a checklist of information to be included No normative definition of good Jozef Stefanailab.ijs.si AI Lab, template Institute
  • 17. Evaluation Qualitative o Usage-specific o Not useful for tuning algorithms Quantitative o Precision o Recall AI Lab, Jozef Stefan Institute ailab.ijs.si