This document discusses semantic embedding techniques for summarizing and analyzing text documents. It describes applying word embeddings to context exploration, topic delineation through document clustering, information retrieval, and concept drift analysis. Word embedding approaches like Word2Vec, GloVe, and Ariadne project words into continuous vector spaces where semantic similarity is represented by vector proximity. These techniques were shown to help retrieve related documents and detect shifts in subject matter over time in the Medline database, demonstrating their utility for semantic analysis of texts.
1 of 34
Download to read offline
More Related Content
Our journey with semantic embedding
1. Our journey with semantic
embedding
Rob Koopman, Shenghui Wang
OCLC
Fourth Annual KnoweScape Conference, 22-24 Feb 2017
2. Agenda
What is semantic embedding
Applications:
Context explorer
Topic delineation
Information retrieval
Concept drift
3. An example by Stefan Evert: whats the meaning of bardiwac?
He handed her her glass of bardiwac.
Beef dishes are made to complement the bardiwacs.
Nigel staggered to his feet, face flushed from too much bardiwac.
Malbec, one of the lesser-known bardiwac grapes, responds well to
Australias sunshine.
I dined on bread and cheese and this excellent bardiwac.
The drinks were delicious: blood-red bardiwac as well as light, sweet
Rhenish.
bardiwac is a heavy red alcoholic beverage made from grapes
4. How can we calculate the similarity/relatedness?
Discrete encoding does not help to automatically process
the underlying semantics
Statistical Semantics [furnas1983, weaver1955] based on
the assumption of a word is characterized by the
company it keeps [firth1957]
Distributional Hypothesis [harris1954, sahlgren2008]:
words that occur in similar contexts tend to have similar
meanings.
5. Lets embed words in a vector space
Words are represented in a continuous vector space
where semantically similar words are mapped to nearby
points ('are embedded nearby each other').
A desirable property: cosine similarity
6. What can we do with the similarity?
Context explorer
Topic delineation
Information retrieval
Concept drift
10. Topic delineation based on clustering
Generate vectors for entities
Generate vectors for articles based on weighted average
of entity vectors
Use standard clustering methods to cluster articles
At the end this approach has proven to be remarkably
compatible with methods based on citation networks.
Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. In J. Gl辰ser, A.
Scharnhorst, & W. Gl辰nzel (Eds.), Same dataDifferent results? (pp. 234556). Towards a comparative
approach to the identification of thematic structures in science. Special Issue of Scientometrics
15. 1. 2014 glycated nail proteins a new approach for detecting diabetes in
developing countries
2. 2015 glycation of nail proteins from basic biochemical findings to a
representative marker for diabetic glycation associated target organ damage
3. 2005 glycation products as markers and predictors of the progression of
diabetic complications
4. 2015 glycated nail proteins as a new biomarker in management of the
south kivu congolese diabetics
5. 2005 advanced glycosylation end products in skin serum saliva and urine and
its association with complications of patients with type 2 diabetes mellitus
6. 1993 review of diabetes identification of markers for early detection glycemic
control and monitoring clinical complications
7. 2012 glycation and biomarkers of vascular complications of diabetes
8. 2005 the nail under fungal siege in patients with type ii diabetes mellitus
9. 2003 improvement in quality of diabetes control and concentrations of age
products in patients with type 1 and insulin treated type 2 diabetes mellitus
studied over a period of 10 years jevin
10. 2005 a novel advanced glycation index and its association with diabetes and
microangiopathy
17. Word embedding techniques
Two main categories of approaches:
global co-occurrence count-based methods, such as
Latent Semantic Analysis and Random Projection
local context predictive methods, such as neural
probabilistic language models
18. Word embedding techniques
Two main categories of approaches:
global co-occurrence count-based methods, such as
Latent Semantic Analysis and Random Projection ---
suffer in word analogy tasks
local context predictive methods, such as neural
probabilistic language models --- do not leverage the
global statistics
19. Word embedding techniques
Ariadne (OCLC): based on Random Projection of the
global co-occurrence matrix
Word2Vec (Google): shallow, two-layer neural networks
that are trained to reconstruct linguistic contexts of words
GloVe (Stanford): a global log-bilinear regression model to
learn word vectors based on the ratio of the co-occurrence
probabilities of two words
21. Word analogy evaluation
Which word is the most similar to Italy in the same sense as
Paris is similar to France?
X=vector(``Paris'')-vector(``France'')+vector(``Italy'')
22. Word analogy evaluation
Which word is the most similar to Italy in the same sense as
Paris is similar to France?
X=vector(``Paris'')-vector(``France'')+vector(``Italy'')
Method Accuracy (%) Runtime
(seconds)
#Thread
Word2Vec 61.4 32,432 16
GloVe 53.6 22,680 16
Ariadne 1.6 15,020 1
23. Information retrieval evaluation
Use case: evidence-based medical guideline
Statement There are no indications to suggest that
a skin-sparing mastectomy followed by
immediate reconstruction leads to a
higher risk of local or systemic
recurrence of breast cancer.
Old references (pmid) 9142378, 1985335
New references (pmid) 9142378, 9694613, 18210199
24. From word embedding to document distance
Doc2Vec: an extension of Word2Vec, that learns to
correlate documents and words, rather than words with
other words
Ariadne: weighted average of word vectors
25. A tiny gold set
29 statements (16 breast cancer, 4 hepatitis C, 4 lung
cancer, 5 ovarian cancer)
103 (96 unique) source articles, 156 (145 unique) target
articles, in total 180 unique articles
66 articles are in both source and target lists, so the
baseline total recall is 42.3% (the average baseline recall
is 45.8%)
These articles were published between 1984 and 2012.
29. Now lets talk about concept drift
20 million Medline articles published since 1977
1.5 million entities (subjects, authors, journals, words)
8 five-year periods
Each subject is embedded in 8 chronological vector
spaces
Is there concept drift and can we detect it?
31. Most and least stable subjects
Most stable subjects Least stable subjects
history 15th century
history 18th century
history 17th century
history 16th century
history 19th century
thymoma
history ancient
history medieval
rabies
history
diagnostic techniques surgical
chromium isotopes
shock surgical
iodine isotopes
diagnostic techniques and procedures
blood circulation time
trauma nervous system
cesium isotopes
liver extracts
macroglobulins
32. Subjects most related to trauma nervous system
1977-
1982
anatomy regional, fracture fixation internal, bulgaria, piedra, surgery plastic, germany west,
wound infection, carbuncle, burns
1982-
1987
legionellosis, povidone, tropocollagen, attention deficit disorder with hyperactivity,
legionnaires disease, transfer psychology
1987-
1992
leg injuries, neurosurgical procedures, arm injuries, wound infection, orthopedic equipment,
dermatomycoses, multiple trauma, candidiasis cutaneous, fractures closed
1992-
1997
piperacillin, tazobactam, microbiology, diagnostic errors, sorption detoxification,
arthroplasty, hsp40 heat shock proteins, emaciation, professional patient relations
1997-
2002
defensive medicine, insurance liability, diagnostic errors, expert testimony, birth injuries,
maleic anhydrides, dimethyl sulfate, medical errors, p protein hepatitis b virus
2002-
2007
peripheral nervous system diseases, peripheral nerve injuries, neurologic examination,
male, recovery of function, peripheral nerves, elbow, comorbidity, mother child relations
2007-
2012
peripheral nerve injuries, sciatic neuropathy, papilledema, sciatic nerve, peripheral nerves,
nerve crush, neuroma, nerve regeneration, acute disease
2012-
2017
mitochondrial dynamics, dental records, park7 protein human, persistent vegetative state,
dnm1l protein human, platelet derived growth factor bb, dual specificity phosphatases,
lingual nerve injuries, dental care
defensive medicine, insurance
liability, diagnostic errors,
expert testimony, birth injuries,
anatomy regional,
fracture fixation
internal, bulgaria,
piedra, surgery
plastic
33. Global drift based on Self Organising Maps
- Create document vectors
- Put the documents in a self organizing map
- For each point in the map count the documents in a year range
- Make sub maps for each year range
- Now color code lower than expected as blue and higher than
expected in red
- The result shows global drift
A point of attention is that this shows how the content of the medline
database drifts over time, not necessarily how science drifts over time.
34. Summary
Semantic indexing enables the operations directly on the
underlying semantics
It helps to explore the context of subject, cluster and
retrieve related documents, and study drift
Different methods have their own limitations
The choice is application sensitive