際際滷

際際滷Share a Scribd company logo
600.465 Connecting the dots - II(NLP in Practice)Delip Raodelip@jhu.edu
Last class Understood how to solve and ace in NLP tasksgeneral methodology or approachesEnd-to-End development using an example taskNamed Entity Recognition
Shared Tasks: NLP in practiceShared Task (aka Evaluations)Everybody works on a (mostly) common datasetEvaluation measures are definedParticipants get ranked on the evaluation measuresAdvance the state of the artSet benchmarksTasks involve common hard problems or new interesting problems
Person Name DisambiguationMusicianPsychologistComputational LinguistPhysicistPhotographerCEOSculptor件件TheologistPastorBiologistTennis PlayerRao, Garera & Yarowsky, 2007
件件
Clustering using web snippetsGoal: To cluster 100 given test documents for name David SmithStep 2: Cluster all the 1100 documents togetherTest Doc 1Test Doc 2Test Doc 3Test Doc 4Test Doc 5Test Doc 6Step 1: Extract top 1000 snippets from GoogleStep 3: Extract the clustering of the test documentsRao, Garera & Yarowsky, 2007
Web Snippets for DisambiguationSnippetSnippets contain high quality, low noise featuresEasy to extractDerived from sources other than the document(e.g., link text)Rao, Garera & Yarowsky, 2007
 Term bridging via SnippetsDocument 2Contains term T6G2H1Document 1Contains term780 492-9920Snippet contains both the terms  780 492-9920 T6G2H1 and that can serve as a bridge for clustering Document 1 and Document 2 togetherRao, Garera & Yarowsky, 2007
Evaluating Clustering outputDispersion: Inter-cluster Silhouette: Intra-clusterOther metrics:PurityEntropyV-measure
Entity LinkingJohn WilliamsRichard Kaufman goes a long way back with John Williams. Trained as a classical violinist, Californian Kaufman started doing session work in the Hollywood studios in the 1970s. One of his movies was Jaws, with Williams conducting his score in recording sessions in 1975...Michael PhelpsDebbie Phelps, the mother of swimming star Michael Phelps, who won a record eight gold medals in Beijing, is the author of a new memoir, ...Michael Phelps is the scientist most often identified as the inventor of PET, a technique that permits the imaging of biological processes in the organ systems of living individuals. Phelps has ...Identify matching entry, or determine that entity is missing from KB
Challenges in Entity LinkingName VariationAbbreviations: BSO vs. Boston Symphony OrchestraShortened forms: Osama Bin Laden vs. Bin LadenAlternate spellings: Osama vs. Ussamah vs. OussamaEntity Ambiguity: Polysemous mentionsE.g., Springfield, WashingtonAbsence: Open domain linkingNot all observed mentions have a corresponding entry in KB (NIL mentions)Ability to predict NIL mentions determines KBP accuracyLargely overlooked in current literature
Entity Linking: FeaturesName-matchingacronyms, aliases, string-similarity, probabilistic FSTDocument FeaturesTF/IDF comparisons, occurrence of names or KB facts in the query text, WikitologyKB NodeType (e.g., is this a person), Features of Wikipedia page, Google rank of corresponding Wikipedia pageAbsence (NIL Indications)Does any candidate look like a good string match?CombinationsLow-string-match AND Acronym AND Type-is-ORG
Entity Linking: Name MatchingAcronymsAlias ListsWikipedia redirects, stock symbols, misc. aliasesExact MatchWith and without normalized punctuation, case, accents, appositive removalFuzzier MatchingDice score (character uni/bi/tri-grams), Hamming, Recursive LCSubstring, SubsequencesWord removal (e.g., Inc., US) and abbrev. expansionWeighted FST for Name EquivalenceTrained models score name-1 as a re-writing of name-2
Entity Linking: Document FeaturesBoW ComparisonsTF/IDF & Dice scores for news article and KB textExamined entire articles and passages around query mentionsNamed-EntitiesRan BBNs SERIF analyzer on articlesChecked for coverage of (1) query co-references and (2) all names/nominals in KB textNoted type, subtype of query entity (e.g., ORG/Media)KB FactsLooked to see if candidate nodes attributes are present in article text (e.g., spouse, employer, nationality)WikitologyUMBC system predicts relevant Wikipedia pages (or KB nodes) for text
Question Answering
Question Answering: Ambiguity
NLP in Practice - Part II
NLP in Practice - Part II
More complication: Opinion Question AnsweringQ: What is the international reaction to the reelection of Robert Mugabe as President of Zimbabwe? A: African observers generally approved of his victory while Western Governments stronglydenounced it.Stoyanov, Cardie, Wiebe  2005 Somasundaran, Wilson, Wiebe, Stoyanov  2007
Subjectivity and Sentiment AnalysisThelinguistic expression of somebodys opinions, sentiments, emotions, evaluations, beliefs, speculations (private states)
Private state: state that is not open to objective observation or verificationQuirk, Greenbaum, Leech, Svartvik (1985). A Comprehensive Grammar of the English Language.Subjectivity analysis classifies content in objective or subjectiveThanks: Jan Wiebe
Rao & Ravichandran, 2009
Subjectivity & Sentiment: Applications
Sentiment classificationDocument levelSentence levelProduct feature levelFor a heavy pot, the handle is not well designed.Find opinion holders and their opinions
Subjectivity & Sentiment: More ApplicationsProduct review mining:Best Android phone in the market?
Sentiment trackingTracking sentimentstoward topics overtime:Is anger ratchetingup or cooling down?Source: Research.ly
Sentiment Analysis Resources : LexiconsRao & Ravichandran, 2009
Sentiment Analysis Resources: Lexicons爐爛爐萎爐	 -爐爛爐項た爐	 +爐謹ぞ爐爛爐	 +爐謹爛爐むた爐謹ぞ爐迦	 +爐爛爐爐爐	 -...爐爛爐萎爐	 -爐爛爐項た爐	 +爐謹ぞ爐爛爐	 +爐謹爛爐むた爐謹ぞ爐迦	 +爐爛爐爐爐	 -...amazing    +banal           -bewilder    -divine         +doldrums   -...amazing    +banal           -bewilder    -divine         +doldrums   -...+ 悴+ 惠悋慍-   惡忰+  愕-   惴惺...+ 悴+ 惠悋慍-   惡忰+  愕-   惴惺...aburrido  -inocente +mejor      +sabroso  +odiar        -....aburrido  -inocente +mejor      +sabroso  +odiar        -....magnifique  +c辿leste          +irr辿gulier       -haine              -...magnifique  +c辿leste          +irr辿gulier       -haine              -......Rao & Ravichandran, 2009
Sentiment Analysis Resources : CorporaPang and Lee, Amazon review corpusBlitzer, multi-domain review corpus
Dependency ParsingConsider product-feature opinion extractionFor a heavy pot, the handle is not well designed.nsubjpassnegdetadvmodthehandleisnotwelldesigned
Dependency RepresentationsOBJADVPRSUBATTVBm奪lade (painted)PNhan (he)JJdj辰rva (bold)NNtavlor (pictures)PPP奪 (In)NN60-talet (the-60s)Directed graphs:V is a set of nodes (tokens)E is a set of arcs (dependency relations)L is a labeling function on E (dependency types)Example:thanks: Nivre
Dependency Parsing: ConstraintsCommonlyimposedconstraints:Single-head (at most one head per node)Connectedness (no dangling nodes)Acyclicity (no cycles in the graph)Projectivity:An arc ij is projective iff, for every k occurring between i and j in the input string, ij.A graph is projective iff every arc in A is projective.thanks: Nivre
Dependency Parsing: ApproachesLink grammar (Sleator and Temperley)Bilexical grammar (Eisner):Lexicalized parsing in O(n3) timeMaximum Spanning Tree (McDonald)CONLL 2006/2007
Syntactic Variations versus Semantic RolesAgent, hitterInstrumentPatient, Thing hitTemporal adjunctYesterday,Kristina hit Scott with a baseballScott was hit by Kristinayesterday with a baseballYesterday, Scott was hit with a baseballby KristinaWith a baseball, Kristina hit ScottyesterdayYesterdayScott was hit by Kristina with a baseballThe baseballwith whichKristinahit Scottyesterday was hard Kristina hit Scott with a baseballyesterdaythanks: Jurafsky
Semantic Role LabelingFor each clause, determine the semantic role played by each noun phrase that is an argument to the verb.agent patientsourcedestinationinstrumentJohn drove Mary from Austin to Dallas in his Toyota Prius.
The hammer broke the window.
Also referred to a case role analysis, thematic analysis, and shallow semantic parsingthanks: Mooney
SRL DatasetsFrameNet: Developed at UCBBased on notion of FramesPropBank:Developed at UPennBased on elaborating the TreebankSalsa:Developed at Universit辰t des SaarlandesGerman version of FrameNet
SRL as Sequence LabelingSRL can be treated as an sequence labeling  problem.For each verb, try to extract a value for each of the possible semantic roles for that verb.Employ any of the standard sequence labeling methodsToken classificationHMMsCRFsthanks: Mooney
SRL with Parse TreesParse trees help identify semantic roles through exploiting syntactic clues like the agent is usually the subject of the verb.Parse tree is needed to identify the true subject.SNPsg                                   VPsgDet     N        PPate the apple.Prep         NPplThe     manby     the store near the dogThe man by the store near the dog ate an apple.The man is the agent of ate not the dog.thanks: Mooney
SRL with Parse Trees NP                          VPV        NPNP            PPDet  A  NDet  A  NDet  A  NbitPrep   NPbig竜竜竜girldogAdj AaThewiththeboyAssume that a syntactic parse is available.For each predicate (verb), label each node in the parse tree as either not-a-role or one of the possible semantic roles.SColor Code:not-a-roleagent patientsourcedestinationinstrumentbeneficiarythanks: Mooney
Selectional RestrictionsSelectional restrictions are constraints that certain verbs place on the filler of certain semantic roles.Agents should be animateBeneficiaries should be animateInstruments should be toolsPatients of eat should be edibleSources and Destinations of go should be places.Sources and Destinations of give should be animate.Taxanomic abstraction hierarchies or ontologies (e.g. hypernym links in WordNet) can be used to determine if such constraints are met.John is a Human which is a Mammal which is a Vertebrate which is an Animatethanks: Mooney
Word SensesBeware of the burning coal underneath the ash.AshCoalSense 1	Trees of the olive family with pinnate leaves, thin furrowed bark and gray branches.Sense 2	The solid residue left when combustible material is thoroughly burned or oxidized.Sense 3	To convert into ashSense 1	A piece of glowing carbon or burnt wood.Sense 2	charcoal.Sense 3	A black solidcombustible substance formed by the partial decomposition of vegetable matter without free access to air and under the influence of moisture and often increased pressure and temperature that is widely used as a fuel for burningSelf-training via Yarowskys Algorithm
Recognizing Textual EntailmentQuestionExpected answer formWhoboughtOverture?     >>    XboughtOvertureOvertures acquisitionby YahooYahoo bought Overtureentailshypothesized answertext Similar for IE:     X acquire Y Similar for semantic IRSummarization (multi-document) MT evaluationthanks: Dagan
(Statistical) Machine Translation
Where will we get P(F|E)?Books inEnglishSame books,in FrenchP(F|E) modelWe call collections stored in two languages parallel corpora or parallel textsWant to update your system? Just add more text!thanks: Nigam
Machine TranslationSystemsEarly rule based systemsWord based models (IBM models)Phrase based models (log-linear!)Tree based models (syntax driven)Adding semantics (WSD, SRL)Ensemble modelsEvaluationMetrics (BLEU, BLACK, ROUGE )Corpora (statmt.org)EGYPTGIZA++MOSESJOSHUA

More Related Content

NLP in Practice - Part II

  • 1. 600.465 Connecting the dots - II(NLP in Practice)Delip Raodelip@jhu.edu
  • 2. Last class Understood how to solve and ace in NLP tasksgeneral methodology or approachesEnd-to-End development using an example taskNamed Entity Recognition
  • 3. Shared Tasks: NLP in practiceShared Task (aka Evaluations)Everybody works on a (mostly) common datasetEvaluation measures are definedParticipants get ranked on the evaluation measuresAdvance the state of the artSet benchmarksTasks involve common hard problems or new interesting problems
  • 4. Person Name DisambiguationMusicianPsychologistComputational LinguistPhysicistPhotographerCEOSculptor件件TheologistPastorBiologistTennis PlayerRao, Garera & Yarowsky, 2007
  • 6. Clustering using web snippetsGoal: To cluster 100 given test documents for name David SmithStep 2: Cluster all the 1100 documents togetherTest Doc 1Test Doc 2Test Doc 3Test Doc 4Test Doc 5Test Doc 6Step 1: Extract top 1000 snippets from GoogleStep 3: Extract the clustering of the test documentsRao, Garera & Yarowsky, 2007
  • 7. Web Snippets for DisambiguationSnippetSnippets contain high quality, low noise featuresEasy to extractDerived from sources other than the document(e.g., link text)Rao, Garera & Yarowsky, 2007
  • 8. Term bridging via SnippetsDocument 2Contains term T6G2H1Document 1Contains term780 492-9920Snippet contains both the terms 780 492-9920 T6G2H1 and that can serve as a bridge for clustering Document 1 and Document 2 togetherRao, Garera & Yarowsky, 2007
  • 9. Evaluating Clustering outputDispersion: Inter-cluster Silhouette: Intra-clusterOther metrics:PurityEntropyV-measure
  • 10. Entity LinkingJohn WilliamsRichard Kaufman goes a long way back with John Williams. Trained as a classical violinist, Californian Kaufman started doing session work in the Hollywood studios in the 1970s. One of his movies was Jaws, with Williams conducting his score in recording sessions in 1975...Michael PhelpsDebbie Phelps, the mother of swimming star Michael Phelps, who won a record eight gold medals in Beijing, is the author of a new memoir, ...Michael Phelps is the scientist most often identified as the inventor of PET, a technique that permits the imaging of biological processes in the organ systems of living individuals. Phelps has ...Identify matching entry, or determine that entity is missing from KB
  • 11. Challenges in Entity LinkingName VariationAbbreviations: BSO vs. Boston Symphony OrchestraShortened forms: Osama Bin Laden vs. Bin LadenAlternate spellings: Osama vs. Ussamah vs. OussamaEntity Ambiguity: Polysemous mentionsE.g., Springfield, WashingtonAbsence: Open domain linkingNot all observed mentions have a corresponding entry in KB (NIL mentions)Ability to predict NIL mentions determines KBP accuracyLargely overlooked in current literature
  • 12. Entity Linking: FeaturesName-matchingacronyms, aliases, string-similarity, probabilistic FSTDocument FeaturesTF/IDF comparisons, occurrence of names or KB facts in the query text, WikitologyKB NodeType (e.g., is this a person), Features of Wikipedia page, Google rank of corresponding Wikipedia pageAbsence (NIL Indications)Does any candidate look like a good string match?CombinationsLow-string-match AND Acronym AND Type-is-ORG
  • 13. Entity Linking: Name MatchingAcronymsAlias ListsWikipedia redirects, stock symbols, misc. aliasesExact MatchWith and without normalized punctuation, case, accents, appositive removalFuzzier MatchingDice score (character uni/bi/tri-grams), Hamming, Recursive LCSubstring, SubsequencesWord removal (e.g., Inc., US) and abbrev. expansionWeighted FST for Name EquivalenceTrained models score name-1 as a re-writing of name-2
  • 14. Entity Linking: Document FeaturesBoW ComparisonsTF/IDF & Dice scores for news article and KB textExamined entire articles and passages around query mentionsNamed-EntitiesRan BBNs SERIF analyzer on articlesChecked for coverage of (1) query co-references and (2) all names/nominals in KB textNoted type, subtype of query entity (e.g., ORG/Media)KB FactsLooked to see if candidate nodes attributes are present in article text (e.g., spouse, employer, nationality)WikitologyUMBC system predicts relevant Wikipedia pages (or KB nodes) for text
  • 19. More complication: Opinion Question AnsweringQ: What is the international reaction to the reelection of Robert Mugabe as President of Zimbabwe? A: African observers generally approved of his victory while Western Governments stronglydenounced it.Stoyanov, Cardie, Wiebe 2005 Somasundaran, Wilson, Wiebe, Stoyanov 2007
  • 20. Subjectivity and Sentiment AnalysisThelinguistic expression of somebodys opinions, sentiments, emotions, evaluations, beliefs, speculations (private states)
  • 21. Private state: state that is not open to objective observation or verificationQuirk, Greenbaum, Leech, Svartvik (1985). A Comprehensive Grammar of the English Language.Subjectivity analysis classifies content in objective or subjectiveThanks: Jan Wiebe
  • 24. Sentiment classificationDocument levelSentence levelProduct feature levelFor a heavy pot, the handle is not well designed.Find opinion holders and their opinions
  • 25. Subjectivity & Sentiment: More ApplicationsProduct review mining:Best Android phone in the market?
  • 26. Sentiment trackingTracking sentimentstoward topics overtime:Is anger ratchetingup or cooling down?Source: Research.ly
  • 27. Sentiment Analysis Resources : LexiconsRao & Ravichandran, 2009
  • 28. Sentiment Analysis Resources: Lexicons爐爛爐萎爐 -爐爛爐項た爐 +爐謹ぞ爐爛爐 +爐謹爛爐むた爐謹ぞ爐迦 +爐爛爐爐爐 -...爐爛爐萎爐 -爐爛爐項た爐 +爐謹ぞ爐爛爐 +爐謹爛爐むた爐謹ぞ爐迦 +爐爛爐爐爐 -...amazing +banal -bewilder -divine +doldrums -...amazing +banal -bewilder -divine +doldrums -...+ 悴+ 惠悋慍- 惡忰+ 愕- 惴惺...+ 悴+ 惠悋慍- 惡忰+ 愕- 惴惺...aburrido -inocente +mejor +sabroso +odiar -....aburrido -inocente +mejor +sabroso +odiar -....magnifique +c辿leste +irr辿gulier -haine -...magnifique +c辿leste +irr辿gulier -haine -......Rao & Ravichandran, 2009
  • 29. Sentiment Analysis Resources : CorporaPang and Lee, Amazon review corpusBlitzer, multi-domain review corpus
  • 30. Dependency ParsingConsider product-feature opinion extractionFor a heavy pot, the handle is not well designed.nsubjpassnegdetadvmodthehandleisnotwelldesigned
  • 31. Dependency RepresentationsOBJADVPRSUBATTVBm奪lade (painted)PNhan (he)JJdj辰rva (bold)NNtavlor (pictures)PPP奪 (In)NN60-talet (the-60s)Directed graphs:V is a set of nodes (tokens)E is a set of arcs (dependency relations)L is a labeling function on E (dependency types)Example:thanks: Nivre
  • 32. Dependency Parsing: ConstraintsCommonlyimposedconstraints:Single-head (at most one head per node)Connectedness (no dangling nodes)Acyclicity (no cycles in the graph)Projectivity:An arc ij is projective iff, for every k occurring between i and j in the input string, ij.A graph is projective iff every arc in A is projective.thanks: Nivre
  • 33. Dependency Parsing: ApproachesLink grammar (Sleator and Temperley)Bilexical grammar (Eisner):Lexicalized parsing in O(n3) timeMaximum Spanning Tree (McDonald)CONLL 2006/2007
  • 34. Syntactic Variations versus Semantic RolesAgent, hitterInstrumentPatient, Thing hitTemporal adjunctYesterday,Kristina hit Scott with a baseballScott was hit by Kristinayesterday with a baseballYesterday, Scott was hit with a baseballby KristinaWith a baseball, Kristina hit ScottyesterdayYesterdayScott was hit by Kristina with a baseballThe baseballwith whichKristinahit Scottyesterday was hard Kristina hit Scott with a baseballyesterdaythanks: Jurafsky
  • 35. Semantic Role LabelingFor each clause, determine the semantic role played by each noun phrase that is an argument to the verb.agent patientsourcedestinationinstrumentJohn drove Mary from Austin to Dallas in his Toyota Prius.
  • 36. The hammer broke the window.
  • 37. Also referred to a case role analysis, thematic analysis, and shallow semantic parsingthanks: Mooney
  • 38. SRL DatasetsFrameNet: Developed at UCBBased on notion of FramesPropBank:Developed at UPennBased on elaborating the TreebankSalsa:Developed at Universit辰t des SaarlandesGerman version of FrameNet
  • 39. SRL as Sequence LabelingSRL can be treated as an sequence labeling problem.For each verb, try to extract a value for each of the possible semantic roles for that verb.Employ any of the standard sequence labeling methodsToken classificationHMMsCRFsthanks: Mooney
  • 40. SRL with Parse TreesParse trees help identify semantic roles through exploiting syntactic clues like the agent is usually the subject of the verb.Parse tree is needed to identify the true subject.SNPsg VPsgDet N PPate the apple.Prep NPplThe manby the store near the dogThe man by the store near the dog ate an apple.The man is the agent of ate not the dog.thanks: Mooney
  • 41. SRL with Parse Trees NP VPV NPNP PPDet A NDet A NDet A NbitPrep NPbig竜竜竜girldogAdj AaThewiththeboyAssume that a syntactic parse is available.For each predicate (verb), label each node in the parse tree as either not-a-role or one of the possible semantic roles.SColor Code:not-a-roleagent patientsourcedestinationinstrumentbeneficiarythanks: Mooney
  • 42. Selectional RestrictionsSelectional restrictions are constraints that certain verbs place on the filler of certain semantic roles.Agents should be animateBeneficiaries should be animateInstruments should be toolsPatients of eat should be edibleSources and Destinations of go should be places.Sources and Destinations of give should be animate.Taxanomic abstraction hierarchies or ontologies (e.g. hypernym links in WordNet) can be used to determine if such constraints are met.John is a Human which is a Mammal which is a Vertebrate which is an Animatethanks: Mooney
  • 43. Word SensesBeware of the burning coal underneath the ash.AshCoalSense 1 Trees of the olive family with pinnate leaves, thin furrowed bark and gray branches.Sense 2 The solid residue left when combustible material is thoroughly burned or oxidized.Sense 3 To convert into ashSense 1 A piece of glowing carbon or burnt wood.Sense 2 charcoal.Sense 3 A black solidcombustible substance formed by the partial decomposition of vegetable matter without free access to air and under the influence of moisture and often increased pressure and temperature that is widely used as a fuel for burningSelf-training via Yarowskys Algorithm
  • 44. Recognizing Textual EntailmentQuestionExpected answer formWhoboughtOverture? >> XboughtOvertureOvertures acquisitionby YahooYahoo bought Overtureentailshypothesized answertext Similar for IE: X acquire Y Similar for semantic IRSummarization (multi-document) MT evaluationthanks: Dagan
  • 46. Where will we get P(F|E)?Books inEnglishSame books,in FrenchP(F|E) modelWe call collections stored in two languages parallel corpora or parallel textsWant to update your system? Just add more text!thanks: Nigam
  • 47. Machine TranslationSystemsEarly rule based systemsWord based models (IBM models)Phrase based models (log-linear!)Tree based models (syntax driven)Adding semantics (WSD, SRL)Ensemble modelsEvaluationMetrics (BLEU, BLACK, ROUGE )Corpora (statmt.org)EGYPTGIZA++MOSESJOSHUA
  • 48. Allied Areas and TasksInformation RetrievalTREC (Large scale experiments)CLEF (Cross Lingual Evaluation Forum)NTCIRFIRE (South Asian Languages)
  • 49. Allied Areas and Tasks(Computational) MusicologyMIREX

Editor's Notes

  1. Go very quickly on this slide do not read it all.DO mention combinations
  2. Syntactic structure consists of binary, asymmetrical relations between the words of a sentence.