際際滷

際際滷Share a Scribd company logo
STL : A Similarity Measure Based on Semantic, Terminological and Linguistic InformationNitish Aggarwaljoint work with Tobias Wunner, MihaelArcanDERI, NUI Galwayfirstname.lastname@deri.orgFriday,19th Aug, 2011DERI, Friday Meeting
OverviewMotivation & ApplicationsWhy STL? SemanticTerminologyLinguisticEvaluationConclusion and future work2
Motivation & ApplicationsSemanticAnnotationSimilarity between corpus data and ontology conceptsSAP AG held 1615 million in short-term liquid assets (2009)dbpedia:SAP_AG xEBR:LiquidAssets at dbpedia:year:20093
SemanticSearchSimilarity between Query and index objectMotivation & ApplicationsSAP liquid asset in 2010Current asset of SAP last yeardbpedia:SAP_AG xEBR:liquid asset at dbpedia:year:2010Net cash of SAP in 2010SAP total amount received in 20104
Motivation & ApplicationsOntologyMatching & AlignmentSimilarity between ontology conceptsifrs:StatementOfFinancialPositionxebr:KeyBalanceSheetAssetsIfrs:Assetsifrs:BiologicalAssetsxebr:SubscribedCapitalUnpaidIfrs:CurrentAssetsIfrs:NonCurrentAssetsxebr:FixedAssetsxebr:CurrentAssetsifrs:PropertyPlantAndEquipmentxebr:TangibleFixedAssetsxebr:IntangibleFixedAssetsxebr:Amount Receivablexebr:LiquidAssetsSimilarity = ?Similarity = ?ifrs:CashAndCashEquivalentsIfrs:TradeAndOtherCurrentReceivablesIfrs:Inventories5
Classical ApproachesString SimilarityLevenshteindistance, Dice CoefficientCorpus-basedLSA, ESA, Google distance,Vector-Space ModelOntology-basedPath distance, Information contentSyntax SimilarityWord-order, Part of Speech6
Why STL?SemanticSemanticstructure and relationsTerminologycomplex terms expressing the same conceptLinguistic Phrase and dependency structure7
STLDefinitionLinear combination of semantic, terminological and linguisticobtained by using a linear regressionFormula usedSTL = w1*S + w2*T + w3*L + Constantw1, w2, w3 represent the contribution of each8
SemanticWuPalmer2*depth(MSCA) / depth(c1) + depth(c2)Resniks Information ContentIC(c) = -log p(c)Intrinsic Information Content (Pirro09)Overcome the analysis of large corpora9
Cont.Intrinsic information content(iIC).where sub(c) is number of sub-concept of given concept c.Pirro_Similarity10
Cont.MSCAsubconcepts = 48IC (TFA) = 0.32AssetsSubscribed Capital UnpaidFixed AssetsCurrent AssetsPirro_Sim = 0.33Pirro_Sim =?StocksTangible Fixed AssetsAmount Receivablesubconcepts = 6IC (AR) = 0.69subconcepts = 9IC (TFA) = 0.60Amount Receivable [total]Amount Receivable  with in one yearAmount Receivable after more than one yearOther Tangible Fixed AssetsProperty, Plant and EquipmentPayments on account and asset in constructionFurniture Fixture and EquipmentTrade DebtorsOther FixtureLand and BuildingOther DebtorsPlant and MachineryOther Property, Plant and EquipmentProperty, Plant and Equipment [Total]11
LimitationDoes semantic structure reflect a good similarity?not necessarilye.g. In xEBR, parent-child relation for describing the layout of 	    	conceptsWork in progress is not a type of asset, although both are linked via the parent-child relationship  12
TerminologyDefinitionCommon naming conventionNgram Vs subtermsIn financial domain, bigram Intangible Fixed is a subtring of Other Intangible Fixed Assets but not a subterm.Terminological similaritymaximal subterm overlap13
Cont.Trade Debts Payable After More Than One Year [[Trade][Debts]][Payable][After More Than One Year][SAP:Payable][Ifrs:After More Than One Year][Investoword:Debt][FinanceDict:Trade Debts][Investopedia:Trade]Financial[Debts][Payable][After More Than One Year]Financial Debts Payable After More Than One Year 14
Multilingual SubtermsTranslatedsubtermsAvailable in otherlanguagesAdvantageReflect terminological similarities that may be available in one language but not in others.Property Plant and Equipment@enSachanlagen@deTangible Fixed Asset @en15
Linguistic	Syntactic InformationBeyond simple word orderphrase structureDependency structurePhrase structureIntangible fixed : adj adj > ??Intangible fixed assets : adj adj n > NPDependency structureAmounts receivable : N Adv : receive:mod, amounts:headReceived amounts : V N : receive:mod, amounts:head16
EvaluationData SetxEBR finance vocabulary269 terms (concept labels)72,361(269*269) termpairsBenchmarksSimSem59: sample of 59 term pairsSimSem200 : sample of 200 term pairs (under construction)17
ExperimentAn overview of similarity measures18
Experiment Results (Simsem59)STL formula usedSTL = 0.1531 * S + 0.5218 * T + 0.1041 * L + 0.1791Correlation between similarity scores & simsem59Semantic ContributionTerminologyContributionLinguistic Contribution19
ConclusionSTL outperforms more traditional similarity measuresLargest contribution by T (Terminological Analysis)Multilingual subterms performs better than monolingual20
Future workEvaluation on larger data set and vocabularies (IFRS)3000+ terms 9M term pairsricher set of linguistic operationsrecognise => recognition 	by derivation rule verb_lemma+"ionSimilarity between subtermsStaff Costs and "Wages And Salaries"21

More Related Content

STL: A similarity measure based on semantic and linguistic information

  • 1. STL : A Similarity Measure Based on Semantic, Terminological and Linguistic InformationNitish Aggarwaljoint work with Tobias Wunner, MihaelArcanDERI, NUI Galwayfirstname.lastname@deri.orgFriday,19th Aug, 2011DERI, Friday Meeting
  • 2. OverviewMotivation & ApplicationsWhy STL? SemanticTerminologyLinguisticEvaluationConclusion and future work2
  • 3. Motivation & ApplicationsSemanticAnnotationSimilarity between corpus data and ontology conceptsSAP AG held 1615 million in short-term liquid assets (2009)dbpedia:SAP_AG xEBR:LiquidAssets at dbpedia:year:20093
  • 4. SemanticSearchSimilarity between Query and index objectMotivation & ApplicationsSAP liquid asset in 2010Current asset of SAP last yeardbpedia:SAP_AG xEBR:liquid asset at dbpedia:year:2010Net cash of SAP in 2010SAP total amount received in 20104
  • 5. Motivation & ApplicationsOntologyMatching & AlignmentSimilarity between ontology conceptsifrs:StatementOfFinancialPositionxebr:KeyBalanceSheetAssetsIfrs:Assetsifrs:BiologicalAssetsxebr:SubscribedCapitalUnpaidIfrs:CurrentAssetsIfrs:NonCurrentAssetsxebr:FixedAssetsxebr:CurrentAssetsifrs:PropertyPlantAndEquipmentxebr:TangibleFixedAssetsxebr:IntangibleFixedAssetsxebr:Amount Receivablexebr:LiquidAssetsSimilarity = ?Similarity = ?ifrs:CashAndCashEquivalentsIfrs:TradeAndOtherCurrentReceivablesIfrs:Inventories5
  • 6. Classical ApproachesString SimilarityLevenshteindistance, Dice CoefficientCorpus-basedLSA, ESA, Google distance,Vector-Space ModelOntology-basedPath distance, Information contentSyntax SimilarityWord-order, Part of Speech6
  • 7. Why STL?SemanticSemanticstructure and relationsTerminologycomplex terms expressing the same conceptLinguistic Phrase and dependency structure7
  • 8. STLDefinitionLinear combination of semantic, terminological and linguisticobtained by using a linear regressionFormula usedSTL = w1*S + w2*T + w3*L + Constantw1, w2, w3 represent the contribution of each8
  • 9. SemanticWuPalmer2*depth(MSCA) / depth(c1) + depth(c2)Resniks Information ContentIC(c) = -log p(c)Intrinsic Information Content (Pirro09)Overcome the analysis of large corpora9
  • 10. Cont.Intrinsic information content(iIC).where sub(c) is number of sub-concept of given concept c.Pirro_Similarity10
  • 11. Cont.MSCAsubconcepts = 48IC (TFA) = 0.32AssetsSubscribed Capital UnpaidFixed AssetsCurrent AssetsPirro_Sim = 0.33Pirro_Sim =?StocksTangible Fixed AssetsAmount Receivablesubconcepts = 6IC (AR) = 0.69subconcepts = 9IC (TFA) = 0.60Amount Receivable [total]Amount Receivable with in one yearAmount Receivable after more than one yearOther Tangible Fixed AssetsProperty, Plant and EquipmentPayments on account and asset in constructionFurniture Fixture and EquipmentTrade DebtorsOther FixtureLand and BuildingOther DebtorsPlant and MachineryOther Property, Plant and EquipmentProperty, Plant and Equipment [Total]11
  • 12. LimitationDoes semantic structure reflect a good similarity?not necessarilye.g. In xEBR, parent-child relation for describing the layout of conceptsWork in progress is not a type of asset, although both are linked via the parent-child relationship 12
  • 13. TerminologyDefinitionCommon naming conventionNgram Vs subtermsIn financial domain, bigram Intangible Fixed is a subtring of Other Intangible Fixed Assets but not a subterm.Terminological similaritymaximal subterm overlap13
  • 14. Cont.Trade Debts Payable After More Than One Year [[Trade][Debts]][Payable][After More Than One Year][SAP:Payable][Ifrs:After More Than One Year][Investoword:Debt][FinanceDict:Trade Debts][Investopedia:Trade]Financial[Debts][Payable][After More Than One Year]Financial Debts Payable After More Than One Year 14
  • 15. Multilingual SubtermsTranslatedsubtermsAvailable in otherlanguagesAdvantageReflect terminological similarities that may be available in one language but not in others.Property Plant and Equipment@enSachanlagen@deTangible Fixed Asset @en15
  • 16. Linguistic Syntactic InformationBeyond simple word orderphrase structureDependency structurePhrase structureIntangible fixed : adj adj > ??Intangible fixed assets : adj adj n > NPDependency structureAmounts receivable : N Adv : receive:mod, amounts:headReceived amounts : V N : receive:mod, amounts:head16
  • 17. EvaluationData SetxEBR finance vocabulary269 terms (concept labels)72,361(269*269) termpairsBenchmarksSimSem59: sample of 59 term pairsSimSem200 : sample of 200 term pairs (under construction)17
  • 18. ExperimentAn overview of similarity measures18
  • 19. Experiment Results (Simsem59)STL formula usedSTL = 0.1531 * S + 0.5218 * T + 0.1041 * L + 0.1791Correlation between similarity scores & simsem59Semantic ContributionTerminologyContributionLinguistic Contribution19
  • 20. ConclusionSTL outperforms more traditional similarity measuresLargest contribution by T (Terminological Analysis)Multilingual subterms performs better than monolingual20
  • 21. Future workEvaluation on larger data set and vocabularies (IFRS)3000+ terms 9M term pairsricher set of linguistic operationsrecognise => recognition by derivation rule verb_lemma+"ionSimilarity between subtermsStaff Costs and "Wages And Salaries"21