STL is a similarity measure that combines semantic, terminological, and linguistic information using linear regression. It outperforms traditional similarity measures in evaluations. Terminological analysis makes the largest contribution to STL scores. Future work will evaluate STL on larger datasets and incorporate richer linguistic operations and similarity between subterms.
1 of 21
Downloaded 17 times
More Related Content
STL: A similarity measure based on semantic and linguistic information
1. STL : A Similarity Measure Based on Semantic, Terminological and Linguistic InformationNitish Aggarwaljoint work with Tobias Wunner, MihaelArcanDERI, NUI Galwayfirstname.lastname@deri.orgFriday,19th Aug, 2011DERI, Friday Meeting
4. SemanticSearchSimilarity between Query and index objectMotivation & ApplicationsSAP liquid asset in 2010Current asset of SAP last yeardbpedia:SAP_AG xEBR:liquid asset at dbpedia:year:2010Net cash of SAP in 2010SAP total amount received in 20104
8. STLDefinitionLinear combination of semantic, terminological and linguisticobtained by using a linear regressionFormula usedSTL = w1*S + w2*T + w3*L + Constantw1, w2, w3 represent the contribution of each8
9. SemanticWuPalmer2*depth(MSCA) / depth(c1) + depth(c2)Resniks Information ContentIC(c) = -log p(c)Intrinsic Information Content (Pirro09)Overcome the analysis of large corpora9
11. Cont.MSCAsubconcepts = 48IC (TFA) = 0.32AssetsSubscribed Capital UnpaidFixed AssetsCurrent AssetsPirro_Sim = 0.33Pirro_Sim =?StocksTangible Fixed AssetsAmount Receivablesubconcepts = 6IC (AR) = 0.69subconcepts = 9IC (TFA) = 0.60Amount Receivable [total]Amount Receivable with in one yearAmount Receivable after more than one yearOther Tangible Fixed AssetsProperty, Plant and EquipmentPayments on account and asset in constructionFurniture Fixture and EquipmentTrade DebtorsOther FixtureLand and BuildingOther DebtorsPlant and MachineryOther Property, Plant and EquipmentProperty, Plant and Equipment [Total]11
12. LimitationDoes semantic structure reflect a good similarity?not necessarilye.g. In xEBR, parent-child relation for describing the layout of conceptsWork in progress is not a type of asset, although both are linked via the parent-child relationship 12
13. TerminologyDefinitionCommon naming conventionNgram Vs subtermsIn financial domain, bigram Intangible Fixed is a subtring of Other Intangible Fixed Assets but not a subterm.Terminological similaritymaximal subterm overlap13
14. Cont.Trade Debts Payable After More Than One Year [[Trade][Debts]][Payable][After More Than One Year][SAP:Payable][Ifrs:After More Than One Year][Investoword:Debt][FinanceDict:Trade Debts][Investopedia:Trade]Financial[Debts][Payable][After More Than One Year]Financial Debts Payable After More Than One Year 14
15. Multilingual SubtermsTranslatedsubtermsAvailable in otherlanguagesAdvantageReflect terminological similarities that may be available in one language but not in others.Property Plant and Equipment@enSachanlagen@deTangible Fixed Asset @en15
16. Linguistic Syntactic InformationBeyond simple word orderphrase structureDependency structurePhrase structureIntangible fixed : adj adj > ??Intangible fixed assets : adj adj n > NPDependency structureAmounts receivable : N Adv : receive:mod, amounts:headReceived amounts : V N : receive:mod, amounts:head16
17. EvaluationData SetxEBR finance vocabulary269 terms (concept labels)72,361(269*269) termpairsBenchmarksSimSem59: sample of 59 term pairsSimSem200 : sample of 200 term pairs (under construction)17
19. Experiment Results (Simsem59)STL formula usedSTL = 0.1531 * S + 0.5218 * T + 0.1041 * L + 0.1791Correlation between similarity scores & simsem59Semantic ContributionTerminologyContributionLinguistic Contribution19
20. ConclusionSTL outperforms more traditional similarity measuresLargest contribution by T (Terminological Analysis)Multilingual subterms performs better than monolingual20
21. Future workEvaluation on larger data set and vocabularies (IFRS)3000+ terms 9M term pairsricher set of linguistic operationsrecognise => recognition by derivation rule verb_lemma+"ionSimilarity between subtermsStaff Costs and "Wages And Salaries"21