�ݺ�ߣ

Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured DataWSU & AFRL Window-on-Science Seminar on Data MiningAmit P. Sheth,LexisNexis Ohio Eminent ScholarDirector, Kno.e.sis center, Wright State Universityknoesis.orgThanks: K. Gomadam, M. Nagarajan, C. Thomas, C. Henson, C. Ramakrishnan, P. Jain and Kno.e.sis Researchers

Data & Knowledge Ecosystem3Situational AwarenessDecision SupportInsightKnowledge DiscoveryAnalysis (eg Patterns)Understanding & PerceptionData MiningIntegrationSearchBrowsingMultimedia DataStructured,SemistructuredUnstructuredDataTextual Data: Scientific Literature, Web Pages, News, Blogs, Reports, Wiki, Forums, Comments, Tweets Experimental DataObservational DataTransactional Data

Some examples of R&D we have doneSemantic Search & Ranking of Stories and Reports �C connecting the dots applications (insider threat, financial risk analysis)Mining of biomedical (scientific) literature (extraction of entities and relationships) �C discovering hidden public knowledgeSemantic Integration, Analysis and Decision Support over Sensor DataExtracting taxonomy/domain model from WikipediaDiscovering Hidden Relationships (insights) in Community Created Content (Wikipedia)4

Understanding User Generated Content (on Social Networking Sites)*What are people talking aboutHow people writeWhy people writeWith application to Artist Popularity Ranking

Identifying Social Signals �C spatio-temporal-thematic analysis of Citizen Sensor Data5* MeenaNagarajan

SearchIntegrationAnalysisDiscoveryQuestion AnsweringSituational AwarenessDomain ModelsPatterns / Inference / ReasoningRDBRelationship WebMeta data / Semantic AnnotationsMetadata ExtractionMultimedia Content and Web dataTextSensor DataStructured and Semi-structured data

Insider threat demo (semantic search/querying, ranking, ��)7

Knowledge Discovery from Scientific LiteratureCarticRamakrishnan

9What Knowledge Discovery is NOT SearchKeyword-in-document-out Keywords are fully specified features of expected outcomeSearching for prospective mining sitesMining Know where to lookUnderspecified characteristics of what is sought are availablePatternsCarticRamakrishnan

10What is knowledge discovery?��knowledge discovery is more like sifting through a warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself. After appropriate assembly, however, a Rolex watch emerges from the disparate parts.�� C James Caruther ��discovery is often described as more opportunistic search in a less well-defined space, leading to a psychological element of surprise�� C James BuchananOpportunistic search over an ill-defined space leading to surprising but useful emergent knowledgeCarticRamakrishnan

Element of surprise �C Swanson��s discoveriesStress?Swanson��s DiscoveriesMagnesiumMigraineCalcium Channel BlockersSpreading Cortical Depression11 possible associations foundPubMedAssociations Discovered based on keyword searches followed by manually analysis of text to establish possible relevant relationships11

Knowledge Discovery over textTextAssigning interpretation to text Semantic metadata in the form ofsemi-structured dataExtraction of Semantics from textSemantic Metadata Guided Knowledge Explorations Semantic Metadata Guided Knowledge DiscoveryTriple-basedSemantic SearchSemanticbrowserSubgraphdiscovery12CarticRamakrishnan

Information Extraction via Ontology assisted text mining �C Relationship extraction4733 documents9284 documents5 documentsUMLS Semantic NetworkcomplicatesBiologically active substanceaffectscausescausesDisease or SyndromeLipidaffectsinstance_ofinstance_of???????Fish OilsRaynaud��s DiseaseMeSHPubMed13CarticRamakrishnan

Background knowledge and Data usedUMLS �C A high level schema of the biomedical domain136 classes and 49 relationshipsSynonyms of all relationship �C using variant lookup (tools from NLM)49 relationship + their synonyms = ~350 verbsMeSH 22,000+ topics organized as a forest of 16 treesUsed to query PubMedPubMed Over 16 million abstractAbstracts annotated with one or more MeSH terms14

Method �C Parse Sentences in PubMedSS-Tagger (University of Tokyo)SS-Parser (University of Tokyo) Entities (MeSH terms) in sentences occur in modified forms

��adenomatous�� modifies ��hyperplasia��

��An excessive endogenous or exogenous stimulation�� modifies ��estrogen��

Entities can also occur as composites of 2 or more other entities

��adenomatous hyperplasia�� and ��endometrium�� occur as ��adenomatous hyperplasia of the endometrium��(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) ) 15CarticRamakrishnan

Method �C Identify entities and relationships in Parse TreeModifiersTOPModified entitiesComposite EntitiesSVPUMLS ID T147NPVBZinducesNPPPNPNPNNestrogenINbyJJexcessivePPDTtheADJPNNstimulationMeSHIDD004967INofJJadenomatousNNhyperplasiaNPJJendogenousJJexogenousCCorMeSHIDD006965NNendometriumDTtheMeSHIDD00471716

Representation �C Resulting RDFModifiersModified entitiesComposite Entities17

18Preliminary Results Swanson��s discoveries �C Associations between Migraine and Magnesium [Hearst99]stress is associated withmigraines

stress can lead to loss of magnesium

calcium channel blockersprevent some migraines

magnesiumis a natural calcium channel blocker

spreading cortical depression (SCD) is implicated in some migraines

high levels of magnesiuminhibit SCD

migraine patients have highplatelet aggregability

magnesium can suppressplatelet aggregabilityData sets generated using these entities (marked red above) as boolean keyword queries against pubmedBidirectional breadth-first search used to find paths in resulting RDF

Paths between Migraine and MagnesiumPaths are considered interesting if they have one or more named relationshipOther thanhasPart or hasModifiers in them19CarticRamakrishnan

An example of such a pathCONCLUSIONRules over parse trees are able to extract structure from sentences

Our definition of compound and modified entities are critical for identifying both implicit and explicit relationships

Swanson��s discovery can be automated �C if recall can be improved �C what hurts recall?20

Unsupervised Joint Extraction of Compound Entities and RelationshipCartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth "Unsupervised Discovery of Compound Entities for Relationship Extraction"EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns

Joint Extraction approachgovernordependentDependency parse �C Stanford Parseramod = adjectival modifiernsubjpass = nominal subject in passive voice22

AlgorithmRelationship headSubject headObject headObject head23CarticRamakrishnan

24Preliminary resultsCarticRamakrishnan

Semantic Metadata Guided Knowledge Explorations and Discovery

Hypothesis Driven retrieval of Scientific Literature affectsMigraineMagnesiumStressisainhibitPatientCalcium Channel BlockersComplex QuerySupportingDocument setsretrievedKeyword query: Migraine[MH] + Magnesium[MH]PubMed28

29ApplicationsTriple-based semantic searchSemantic Browser

30Knowledge Discovery = Extraction + Heuristic AggregationUndiscovered Public Knowledge

Understanding, Analyzing, Mining Social MediaMeenaNagarajan, Karthik Gomadam

another chapter in the war against civilization

Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data

the world saw itThrough the eyes of the people

the world read itThrough the words of the people

PEOPLE told their stories to PEOPLE

A powerful new era in Information dissemination had taken firm ground

Making it possible for us tocreate a global network of citizensCitizen Sensors �C Citizens observing, processing, transmitting, reporting

Geocoder(Reverse Geo-coding)Address to location database18 Hormusji Street, ColabaVasantViharImage Metadatalatitude: 18�� 54�� 59.46�� N, longitude: 72�� 49�� 39.65�� EStructured Meta ExtractionNariman HouseIncome Tax OfficeIdentify and extract information from tweetsSpatio-Temporal Analysis

Research Challenge #1Spatio Temporal and Thematic analysisWhat else happened ��near�� this event location?What events occurred ��before�� and ��after�� this event?Any message about ��causes�� for this event?

Spatial Analysis��.Which tweets originated from an address near 18.916517��N 72.827682��E?

Which tweets originated during Nov 27th 2008,from 11PM to 12 PM

Giving usTweets originated from an address near 18.916517��N, 72.827682��E during time interval27th Nov 2008 between 11PM to 12PM?

Research Challenge #2:Understanding and Analyzing Casual TextCasual textMicroblogs are often written in SMS style languageSlangs, abbreviations

Understanding Casual TextNot the same as news articles or scientific literatureGrammatical errorsImplications on NL parser resultsInconsistent writing styleImplications on learning algorithms that generalize from corpus

Nature of MicroblogsAdditional constraint of limited contextMax. of x chars in a microblogContext often provided by the discourseEntity identification and disambiguationPre-requisite to other sophisticated information analytics

NL understanding is hard to begin with..Not so hard��commando raid appears to be nigh at Oberoinow��Oberoi = Oberoi Hotel, Nigh = highChallengingnew wing, live fire @ taj 2nd floor on iDesi TV streamFire on the second floor of the Taj hotel, not on iDesi TV

Research OpportunitiesNER, disambiguation in casual, informal text is a budding area of researchAnother important area of focus: Combining information of varied quality from a corpus (statistical NLP), domain knowledge (tags, folksonomies, taxonomies, ontologies), social context (explicit and implicit communities)

Social Context surrounding contentSocial context in which a message appears is also an added valuable resourcePost 1: ��Hareemane Househostages said by eyewitnesses to be Jews. 7 Gunshots heard by reporters at Taj��Follow up postthat is Nariman House, not (Hareemane)

Understanding content �� informal textI say: ��Your music is wicked�� What I really mean: ��Your music is good�� 54

Urban DictionarySentiment expression: Rocks Transliterates to: cool, goodStructured text (biomedical literature)Semantic Metadata: Smile is a TrackLil transliterates to Lilly AllenLilly Allen is an ArtistMusicBrainz TaxonomyInformal Text (Social Network chatter)Artist: Lilly AllenTrack: Smile Your smile rocks LilMultimedia Content and Web dataWeb Services

Example: Pulse of a CommunityImagine millions of such informal opinionsIndividual expressions to mass opinions��Popular artists�� lists from MySpace commentsLilly Allen Lady Sovereign Amy WinehouseGorillazColdplayPlaceboStingKeanJoss Stone

What Drives the Spatio-Temporal-Thematic Analysis and Casual Text UnderstandingSemantics with the help ofDomain ModelsDomain ModelsDomain Models(ontologies, folksonomies)

Domain Knowledge: A key driverPlaces that are nearby ��Nariman house��Spatial queryMessages originated around this placeTemporal analysisMessages about related events / placesThematic analysis

Research Challenge #3But Where does the Domain Knowledge come from?Expert and committee based ontology creation �� works in some domains (e.g., biomedicine, health care,��)Community driven knowledge extraction How to create models that are ��socially scalable��?How to organically grow and maintain this model?

Building models��seed word to hierarchy creation using WIKIPEDIAQuery: ��cognition��

Identifying relationships: Hard, harder than many hard things But NOT that Hard, When WE do it

Games with a purposeGet humans to give their solitaire time Solve real hard computational problemsImage tagging, Identifying part of an image Tag a tune, Squigl, Verbosity, and MatchinPioneered by Luis Von Ahn

OntoLablrRelationship Identification Gameleads to

causesExplosionTraffic congestion

How do you get comprehensive situational awareness by merging ��human sensing�� and ��machine sensing��?64

Research Challenge #4: Semantic Sensor Web

Semantically Annotated O&M<swe:component name="time"> <swe:Time definition="urn:ogc:def:phenomenon:time" uom="urn:ogc:def:unit:date-time"> <sa:swe rdfa:about="?time" rdfa:instanceof="time:Instant"> <sa:sml rdfa:property="xs:date-time"/> </sa:swe> </swe:Time></swe:component><swe:component name="measured_air_temperature"> <swe:Quantity definition="urn:ogc:def:phenomenon:temperature�� uom="urn:ogc:def:unit:fahrenheit"> <sa:swe rdfa:about="?measured_air_temperature�� rdfa:instanceof=��senso:TemperatureObservation"> <sa:swe rdfa:property="weather:fahrenheit"/> <sa:swe rdfa:rel="senso:occurred_when" resource="?time"/> <sa:swe rdfa:rel="senso:observed_by" resource="senso:buckeye_sensor"/> </sa:sml> </swe:Quantity></swe:component><swe:value name=��weather-data"> 2008-03-08T05:00:00,29.1</swe:value>

�ݺ�ߣ

Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data

Recommended

More Related Content

Similar to Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data (20)

Recently uploaded (20)

Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data

Editor's Notes