This document proposes a new "Topic Consensus" measure to evaluate the interpretability of statistical topics discovered via topic modeling. It uses Amazon Mechanical Turk workers to gauge their understanding of topics by asking them to assign scientific abstracts to topics, and compares this to the topic assignments from LDA. It finds Topic Consensus correlates well with existing automated measures of topic quality and can predict the consensus value. The measure provides a new perspective on evaluating discovered topics compared to existing methods.
The document is a thesis proposal by Justin Sybrandt at Clemson University that outlines his past and proposed work on exploiting latent features in text and graphs. It summarizes Sybrandt's peer-reviewed work using embeddings to generate biomedical hypotheses from text and validate hypotheses through ranking. It also discusses pending work on heterogeneous bipartite graph embeddings and partitioned hypergraphs. The proposal provides background on Sybrandt's hypothesis generation work and outlines his proposed future research directions involving graph embeddings.
Goal: Provide an overview of data mining
Define data mining
Data mining vs. databases
Basic data mining tasks
Data mining development
Data mining issues
This document provides an overview and introduction to data mining. It defines data mining and distinguishes it from databases. The document outlines basic data mining tasks like classification, clustering, regression, and summarization. It also discusses the relationship between data mining and knowledge discovery in databases (KDD) and covers data mining models, development, issues, and metrics.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
油
際際滷s for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
際際滷s for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
I argue why I think that Computer Science (or better: Informatics) is a "natural science", in the same sense that physics, astronomy, biology, psychology and sociology are a natural science: they study a part of the world around us. In that same sense, I think Informatics studies a part of the world around us.
For a similar talk (including script), but more aimed at a Semantic Web audience in particular, see http://www.cs.vu.nl/~frankh/spool/ISWC2011Keynote/
(or http://videolectures.net/iswc2011_van_harmelen_universal/ for a video registration)
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
油
This document describes a study that uses natural language processing and text mining techniques to identify future work statements in scientific papers and extract keywords from those statements. The researchers developed a multi-step pipeline to first identify the future work section, then select future work sentences within that section. They used rules and algorithms to identify sentences discussing future work. Keywords were then extracted from the selected sentences using the RAKE algorithm. An analysis found that 31.4% of papers contained future work statements, with medical science papers having the highest overlap between future work and title-abstract keywords. The researchers hope this work is a first step toward predicting future research topics.
Wikipedia as an Ontology for Describing DocumentsZareen Syed
油
The document presents an approach called "Wikitology" that uses Wikipedia as an ontology for describing and summarizing documents. It outlines methods for mapping documents to Wikipedia articles and categories using similarity metrics and spreading activation across the Wikipedia link graph. The authors conducted experiments predicting concepts for single and multiple documents and evaluated accuracy by comparing to human judgments. They discuss applications like document expansion and future work improving the methods and exploiting the Wikipedia structure.
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
油
The document proposes incorporating Chinese radicals into neural machine translation models. It discusses related work incorporating word and character level information into neural MT. The proposed model combines radical-level MT with an attention-based neural model, representing input text with word, character, and radical combinations. Experiments show the character+radical and word+radical models outperform baselines on standard MT evaluation metrics using a Chinese-English dataset. Future work includes improving model optimization and testing on additional data.
Blendle: Diverse recommendations from a vast archiveJasper Oosterman
油
Industry talk at the RecSysNL meetup November 26th 2020.
Experiment on Viewpoint Diversity is based on the MSc thesis work of Mats Mulder. His thesis van be found here: http://resolver.tudelft.nl/uuid:7def1215-5b30-4536-8b8f-15588e2703e6
There are many examples of text-based documents (all in electronic format)
e-mails, corporate Web pages, customer surveys, r辿sum辿s, medical records, DNA sequences, technical papers, incident reports, news stories and more
Not enough time or patience to read
Can we extract the most vital kernels of information
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first!
Some others (e.g. DNA seq.) are hard to comprehend!
In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
Intra- and interdisciplinary cross-concordances for information retrieval GESIS
油
Intra- and interdisciplinary cross-concordances were created to improve information retrieval across heterogeneous collections. The KoMoHe project created 25 vocabularies in 64 cross-concordances, mapping 380,000 terms through 465,000 relations. Information retrieval tests found that cross-concordances significantly improved recall and precision more for interdisciplinary searches compared to intradisciplinary searches, which had more identical terms. Mapping projects should perform more information retrieval tests to measure the effect of their mappings on search effectiveness.
HyperMembrane Structures for Open Source Cognitive ComputingJack Park
油
Open source "cognitive computing" systems, specifically OpenSherlock; describes a HyperMembrane structure, a kind of information fabric, for machine reading, literature-based discovery, deep question answering. Platform is open source, uses ElasticSearch, topic maps, JSON, link-grammar parsing, and qualitative process models.
This document provides an overview of text mining and natural language processing. It discusses the basic steps which include preprocessing text like tokenization and named entity recognition. Use cases demonstrated include identifying named entities, detecting economic indicators, sentiment analysis, and topic modeling. Finally, it discusses how machine learning can be applied to solve text mining tasks.
This document provides an overview of natural language processing (NLP). It discusses several commercial applications of NLP including information retrieval, information extraction, machine translation, question answering, and processing user-generated content. It notes that major tech companies have strong NLP research labs. The document then discusses why NLP is important due to the huge amount of online data and need to process large texts. It also notes challenges for computers in understanding language due to their lack of common sense knowledge. The rest of the document outlines various issues and subfields within NLP including syntax, semantics, information extraction, information retrieval, machine translation and more. It concludes by overviewing what will be covered in the NLP course.
A Text Mining Research Based on LDA Topic Modellingcsandit
油
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
油
A Large number of digital text information is generated every day. Effectively searching,
managing and exploring the text data has become a main task. In this paper, we first represent
an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles and users tweets topic modelling. The
former one builds up a document topic model, aiming to a topic perspective solution on
searching, exploring and recommending articles. The latter one sets up a user topic model,
providing a full research and analysis over Twitter users interest. The experiment process
including data collecting, data pre-processing and model training is fully documented and
commented. Further more, the conclusion and application of this paper could be a useful
computation tool for social and business research.
4 text mining and open ended questions in sample surveys ludovic lebart cnrsEvelyn Femat
油
This document discusses text mining and open-ended questions in sample surveys. It provides an overview of text mining principles and techniques for analyzing open-ended survey responses. Examples are given of open-ended life satisfaction questions from an international survey and responses to marketing copy tests. Methods are described for extracting statistical units from text data and converting texts into numerical data for analysis.
This document describes the process of automatically generating topic pages from scientific documents at Elsevier. It involves tagging documents with concepts from a taxonomy, selecting relevant candidate sentences, training a machine learning model on human-labeled data using active learning, and classifying sentences as definitions or snippets. The resulting topic pages provide freely available information to readers and drive traffic and conversions. An evaluation on a public dataset showed promising results for the definition classification model. The system aims to continuously improve topic page quality through machine learning.
Beyond document retrieval using semantic annotations Roi Blanco
油
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
油
Ron Daniel and Corey Harper of Elsevier Labs present at the Columbia University Data Science Institute: https://www.elsevier.com/connect/join-us-as-elsevier-data-scientists-present-at-columbia-university
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Data Works MD
油
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Fortune 500 Company Performance Analysis Using Social Networks
Speaker: Yi-Shan Shir
This presentation focus on studying the correlation between financial performance and social media relationship and behavior of Fortune 500 companies. The findings from this research can assist in the prediction of Fortune 500 stock performance based on a number of social network analysis metrics.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
This document provides an introduction and overview of an Applied Natural Language Processing course. It introduces the instructors and discusses administrative details like assignments, resources, and communication. Key topics covered in the course are also introduced, including what natural language processing is, why it is difficult, and corpus-based statistical approaches. The goals of the course are to understand natural language analysis problems and solutions, and learn to apply algorithms and use NLP software and resources. Students will complete coding assignments using Python and NLTK and a final group project.
With the goal of building a high quality academic library collection in mind, the presenters evaluated the value of journal content accessed through journal aggregator database(s). Data from aggregator provider(s) and data from UlrichsWeb was used to evaluate content with respect for quality, format, coverage and cost. In addition the presenters shared the analysis with library liaisons to inform them of true holdings to assist them with collection development.
Beverly Geckle
Serials & Government Documents Librarian, Middle Tennessee State University
Suzanne Mangrum
Collection Assessment & Development Librarian, Middle Tennessee State University
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
油
The document proposes incorporating Chinese radicals into neural machine translation models. It discusses related work incorporating word and character level information into neural MT. The proposed model combines radical-level MT with an attention-based neural model, representing input text with word, character, and radical combinations. Experiments show the character+radical and word+radical models outperform baselines on standard MT evaluation metrics using a Chinese-English dataset. Future work includes improving model optimization and testing on additional data.
Blendle: Diverse recommendations from a vast archiveJasper Oosterman
油
Industry talk at the RecSysNL meetup November 26th 2020.
Experiment on Viewpoint Diversity is based on the MSc thesis work of Mats Mulder. His thesis van be found here: http://resolver.tudelft.nl/uuid:7def1215-5b30-4536-8b8f-15588e2703e6
There are many examples of text-based documents (all in electronic format)
e-mails, corporate Web pages, customer surveys, r辿sum辿s, medical records, DNA sequences, technical papers, incident reports, news stories and more
Not enough time or patience to read
Can we extract the most vital kernels of information
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first!
Some others (e.g. DNA seq.) are hard to comprehend!
In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
Intra- and interdisciplinary cross-concordances for information retrieval GESIS
油
Intra- and interdisciplinary cross-concordances were created to improve information retrieval across heterogeneous collections. The KoMoHe project created 25 vocabularies in 64 cross-concordances, mapping 380,000 terms through 465,000 relations. Information retrieval tests found that cross-concordances significantly improved recall and precision more for interdisciplinary searches compared to intradisciplinary searches, which had more identical terms. Mapping projects should perform more information retrieval tests to measure the effect of their mappings on search effectiveness.
HyperMembrane Structures for Open Source Cognitive ComputingJack Park
油
Open source "cognitive computing" systems, specifically OpenSherlock; describes a HyperMembrane structure, a kind of information fabric, for machine reading, literature-based discovery, deep question answering. Platform is open source, uses ElasticSearch, topic maps, JSON, link-grammar parsing, and qualitative process models.
This document provides an overview of text mining and natural language processing. It discusses the basic steps which include preprocessing text like tokenization and named entity recognition. Use cases demonstrated include identifying named entities, detecting economic indicators, sentiment analysis, and topic modeling. Finally, it discusses how machine learning can be applied to solve text mining tasks.
This document provides an overview of natural language processing (NLP). It discusses several commercial applications of NLP including information retrieval, information extraction, machine translation, question answering, and processing user-generated content. It notes that major tech companies have strong NLP research labs. The document then discusses why NLP is important due to the huge amount of online data and need to process large texts. It also notes challenges for computers in understanding language due to their lack of common sense knowledge. The rest of the document outlines various issues and subfields within NLP including syntax, semantics, information extraction, information retrieval, machine translation and more. It concludes by overviewing what will be covered in the NLP course.
A Text Mining Research Based on LDA Topic Modellingcsandit
油
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
油
A Large number of digital text information is generated every day. Effectively searching,
managing and exploring the text data has become a main task. In this paper, we first represent
an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles and users tweets topic modelling. The
former one builds up a document topic model, aiming to a topic perspective solution on
searching, exploring and recommending articles. The latter one sets up a user topic model,
providing a full research and analysis over Twitter users interest. The experiment process
including data collecting, data pre-processing and model training is fully documented and
commented. Further more, the conclusion and application of this paper could be a useful
computation tool for social and business research.
4 text mining and open ended questions in sample surveys ludovic lebart cnrsEvelyn Femat
油
This document discusses text mining and open-ended questions in sample surveys. It provides an overview of text mining principles and techniques for analyzing open-ended survey responses. Examples are given of open-ended life satisfaction questions from an international survey and responses to marketing copy tests. Methods are described for extracting statistical units from text data and converting texts into numerical data for analysis.
This document describes the process of automatically generating topic pages from scientific documents at Elsevier. It involves tagging documents with concepts from a taxonomy, selecting relevant candidate sentences, training a machine learning model on human-labeled data using active learning, and classifying sentences as definitions or snippets. The resulting topic pages provide freely available information to readers and drive traffic and conversions. An evaluation on a public dataset showed promising results for the definition classification model. The system aims to continuously improve topic page quality through machine learning.
Beyond document retrieval using semantic annotations Roi Blanco
油
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
油
Ron Daniel and Corey Harper of Elsevier Labs present at the Columbia University Data Science Institute: https://www.elsevier.com/connect/join-us-as-elsevier-data-scientists-present-at-columbia-university
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Data Works MD
油
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Fortune 500 Company Performance Analysis Using Social Networks
Speaker: Yi-Shan Shir
This presentation focus on studying the correlation between financial performance and social media relationship and behavior of Fortune 500 companies. The findings from this research can assist in the prediction of Fortune 500 stock performance based on a number of social network analysis metrics.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
This document provides an introduction and overview of an Applied Natural Language Processing course. It introduces the instructors and discusses administrative details like assignments, resources, and communication. Key topics covered in the course are also introduced, including what natural language processing is, why it is difficult, and corpus-based statistical approaches. The goals of the course are to understand natural language analysis problems and solutions, and learn to apply algorithms and use NLP software and resources. Students will complete coding assignments using Python and NLTK and a final group project.
With the goal of building a high quality academic library collection in mind, the presenters evaluated the value of journal content accessed through journal aggregator database(s). Data from aggregator provider(s) and data from UlrichsWeb was used to evaluate content with respect for quality, format, coverage and cost. In addition the presenters shared the analysis with library liaisons to inform them of true holdings to assist them with collection development.
Beverly Geckle
Serials & Government Documents Librarian, Middle Tennessee State University
Suzanne Mangrum
Collection Assessment & Development Librarian, Middle Tennessee State University
NLP
Machine learning
is an油interdisciplinary油subfield of油computer science油and油linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing油natural language油datasets, such as油text corpora油or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based)油machine learning油approaches. The goal is a computer capable of "understanding" the contents of documents, including the油contextual油nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
is an油interdisciplinary油subfield of油computer science油and油linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing油natural language油datasets, such as油text corpora油or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based)油machine learning油approaches. The goal is a computer capable of "understanding" the contents of documents, including the油contextual油nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
- The document discusses open science and various techniques used in the Data4Impact project such as text analysis, social media data collection from Twitter, and linked open data.
- It provides an overview of science norms and compares traditional CUDOS norms to more open PLACE norms.
- Data4Impact aims to build a knowledge graph linking different data sources to analyze the impact of research and innovation funding through new metrics and indicators. Machine learning and linked open data techniques are applied.
Elsevier aims to construct knowledge graphs to help address challenges in research and medicine. Knowledge graphs link entities like people, concepts, and events to provide answers. Elsevier analyzes text and data to build knowledge graphs using techniques like information extraction, machine learning, and predictive modeling. Their knowledge graph integrates data from publications, clinical records, and other sources to power applications that help researchers, medical professionals, and patients. Knowledge graphs are a critical component for delivering value, especially as data volumes and needs accelerate.
Arc 323 human studies in architecture fall 2018 lecture 3-literature reviewGalala University
油
This document provides an overview of conducting a literature review for research in architectural engineering. It discusses that a literature review surveys various sources to produce more lasting and widely useful knowledge, and should be done throughout the research process, not just at the beginning. The document outlines that a literature review summarizes existing information on a specific topic and places it in the broader context of relevant literature. It also compares a literature review to an annotated bibliography, and discusses the uses of a literature review in identifying a research question, focusing the topic, understanding ideas and the current conceptual landscape. Finally, it provides guidance on finding resources, developing an organizing and retrieval system, and taking effective notes.
Female Short Creators 120 - Zsolt NemethZsolt Nemeth
油
The 110 plus 10 most attractive ladies on YouTube or many social-media. List of 120 Hottest Women Bloggers in retrospect gallery to public browsing topics upscaling.
Amplifying Black Voices: The Power of Social Media Listening & Inclusive Mark...Jasper Colin
油
As Black History Month 2025 wraps up, social media has only scratched the surface of how different generations engage with Black culture, history, and representation.
Buy Facebook Reactions Boost Your Posts Instantly Sociocosmos.pdfSocioCosmos
油
Looking to increase engagement on your Facebook posts? Sociocosmos offers a reliable and efficient way to buy Facebook reactions. Get the reactions you need to make your content stand out and reach a wider audience. Enhance your social proof and create a buzz around your posts with our easy-to-use service. We provide genuine reactions, ensuring a natural and organic look for your profile. Discover how Sociocosmos can help you amplify your Facebook presence today.
E-Commerce Platforms
E-commerce platforms are digital systems that enable businesses to create online stores and sell products or services. They provide tools for managing inventory, processing payments, and handling customer interactions.
Types of E-Commerce Platforms:
Business-to-Consumer (B2C):
Platforms like Amazon and Shopify, where businesses sell directly to consumers.
Business-to-Business (B2B):
Platforms like Alibaba, where businesses sell to other businesses.
Consumer-to-Consumer (C2C):
Platforms like eBay, where consumers sell to other consumers.
Consumer-to-Business (C2B):
Platforms where individuals sell products or services to businesses, such as freelance platforms like Fiverr.
If you need more details or have any questions, feel free to ask!
Top Social Media Marketing Services in Delhi & Mumbai.pdfrajputkamal8929
油
At Technians Softech, we provide top-notch social media marketing services in Delhi and Mumbai. With years of experience in digital marketing, our team ensures that brands reach their target audience effectively, boost engagement, and achieve higher ROI.
You Got Your WordPress In My Fediverse / You Got Your Fediverse in My WordPre...John Eckman
油
WordPress can be a first-class participant in the Fediverse, an open web social network built around the ActivityPub protocol.
Using the ActivityPub plugin gets your content in the Fediverse; using the Friends plugin gets the Fediverse in your WordPress
Unlock your creative potential with BLYX Studio: An all-in-one AI platform fo...SOFTTECHHUB
油
Are you struggling to create professional-looking visual content for your business or personal projects? In today's digital landscape, compelling visuals can make or break your online presence, but not everyone has the design skills or budget to hire professional creators. That's where BLYX Studio comes in - an innovative AI-powered platform that's changing the game for content creators everywhere.
Unlock your creative potential with BLYX Studio: An all-in-one AI platform fo...SOFTTECHHUB
油
Text, Topics, and Turkers: A Consensus Measure for Statistical Topics
1. Text, Topics, and Turkers. Hypertext 2015 1
Text, Topics, and Turkers:
A Consensus Measure for Statistical Topics
Fred Morstatter, J端rgen Pfeffer,
Katja Mayer*, Huan Liu
Arizona State University
Tempe, Arizona, USA
Carnegie Mellon University
Pittsburgh, Pennsylvania, USA
*University of Vienna
Vienna, Austria
2. Text, Topics, and Turkers. Hypertext 2015 2
Text
Text is everywhere in research.
Text is huge:
Too much data to read.
How can we understand what is going on in
big text data?
Source Size
Wikipedia 36 million pages
World Wide Web 100+ billion static web pages
Social Media 500 million new tweets/day
3. Text, Topics, and Turkers. Hypertext 2015 3
Topics
Topic Modeling
Latent Dirichlet Allocation (LDA)
Most commonly-used topic modeling algorithm
Discovers topics within a corpus
Corpus
LDA
K
Topic ID Words
Topic 1 cat, dog, horse, ...
Topic 2 ball, field, player, ...
... ...
Topic K red, green, blue, ...
Topic 1 Topic 2 ... Topic K
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
4. Text, Topics, and Turkers. Hypertext 2015 4
Topics
LDA
K = 10
Topic ID Words
Topic 1 river, lake, island, mountain, area, park, antarctic, south, mountains, dam
Topic 2 relay, athletics, metres, freestyle, hurdles, ret, divis達o, athletes, bundesliga,
medals
... ...
Topic 10 courcelles, centimeters, mattythewhite, wine, stamps, oko, perennial, stubs,
ovate, greyish
Topic 1 Topic 2
...
Topic 10
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
5. Text, Topics, and Turkers. Hypertext 2015 5
Topics
How can we measure the quality of statistical
topics?
We dont know how well humans can
interpret topics.
Problem: Does their understanding match
what is going on in the corpus?
6. Text, Topics, and Turkers. Hypertext 2015 6
Turkers
One Solution: Crowdsourcing
Example: Amazons Mechanical Turk
Show LDA results to Turkers
Gauge their understanding
How to effectively measure understanding?
7. Text, Topics, and Turkers. Hypertext 2015 7
Turkers
Previous Work: Chang et. al 2009
Word Intrusion
Topic Intrusion
Corpus
LDA
K
Topic ID Words
Topic 1 cat, dog, horse, ...
Topic 2 ball, field, player, ...
... ...
Topic K red, green, blue, ...
Topic 1 Topic 2 ... Topic K
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
Word Intrusion
Topic Intrusion
8. Text, Topics, and Turkers. Hypertext 2015 8
Word Intrusion
Show the Turker 6 words in random order
Top 5 words from topic
1 Intruded word
Ask Turker to choose Intruded word
cat dog bird truck horse snake
Topic i:
[Chang et. al 2009]
9. Text, Topics, and Turkers. Hypertext 2015 9
Topic Intrusion
Show the Turker a document
Show the Turker 4 topics
3 most probable topics
1 Intruded topic
Ask Turker to choose Intruded Topic
Documenti
Topic A Topic B Topic C Topic D
[Chang et. al 2009]
10. Text, Topics, and Turkers. Hypertext 2015 10
New Measure: Topic Consensus
Corpus
LDA
K
Topic ID Words
Topic 1 cat, dog, horse, ...
Topic 2 ball, field, player, ...
... ...
Topic K red, green, blue, ...
Topic 1 Topic 2 ... Topic K
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
Word Intrusion
Topic Intrusion
Complements existing framework
Measures topic quality with corpus.
Topic Consensus
11. Text, Topics, and Turkers. Hypertext 2015 11
Topic Consensus: Intuition
Measures the agreement between topics and
sections they come from.
LDA Distribution Turker Distribution
12. Text, Topics, and Turkers. Hypertext 2015 12
Topic Consensus: Calculation
We are comparing probability distributions.
Jensen-Shannon Divergence.
Turker Distribution LDA Distribution
13. Text, Topics, and Turkers. Hypertext 2015 13
Dataset
Scientific Abstracts
All available abstracts
since 2007.
Classified into three areas:
Social Sciences & Humanities (SH)
Life Sciences (LS)
Physical Sciences (PE)
Ran LDA on this dataset:
K = [10, 25, 50, 100]
185 topics; 4 topic sets.
14. Text, Topics, and Turkers. Hypertext 2015 14
Turkers
One task:
Turkers have 3 + 1 options.
Each task solved 8 times.
16. Text, Topics, and Turkers. Hypertext 2015 16
Other Topic Sets
LDA Topics
Use New York Times dataset from one day.
25 topics, 1 topic set
Hand-Picked Topics
Pure Social Science & Humanities
Sampled words that occur only in these documents.
11 topics, 1 topic set
Random Topics
Randomly choose topics according to word distribution
of corpus.
25 topics, 1 topic set
17. Text, Topics, and Turkers. Hypertext 2015 17
Results
Topic Set
ERC-10
ERC-25
ERC-50
ERC-100
NYT-25
RAND-25
SH-25
18. Text, Topics, and Turkers. Hypertext 2015 18
Overview of the Process
Topic Consensus can reveal new information
about the topics being studied.
Can measure topics from a new perspective.
Can help reveal topic confusion.
Drawbacks:
Expensive
Time Consuming
Scalability
19. Text, Topics, and Turkers. Hypertext 2015 19
Automated Measures
1. Topic Size: Number of tokens assigned to the
topic.
2. Topic Coherence: Probability that the top
words co-occur in documents in the corpus.
3. Topic Coherence Significance: Significance of
Topic Coherence compared to other topics.
4. Normalized Pointwise Mutual Information:
Measures the association between the top
words in the topics.
20. Text, Topics, and Turkers. Hypertext 2015 20
Measures
Herfindahl-Hirschman Index (HHI)
Measures concentration of a market.
Used to find monopolies.
Viewed from two perspectives:
Word Probability HHI5. 6.
Social Sciences Physical Sciences Life Sciences
ERC Section HHI
22. Text, Topics, and Turkers. Hypertext 2015 22
Results - Prediction
Build classifier to predict actual Topic
Consensus value.
Build linear regression model:
Takes automated measures.
Predicts Topic Consensus.
RMSE: 0.12 賊 0.02.
23. Text, Topics, and Turkers. Hypertext 2015 23
Acknowledgements
Members of the DMML lab
Office of Naval Research through grant
N000141410095
LexisNexis and HPCC Systems
24. Text, Topics, and Turkers. Hypertext 2015 24
Conclusion
Introduced a new method for evaluating the
interpretability of statistical topics.
Demonstrated this measure on a real-world
dataset.
Automated this measure for scalability.
25. Text, Topics, and Turkers. Hypertext 2015 25
Future Work
How sensitive are measures to top words?
Word Intrusion uses 5
Topic Intrusion uses 5
Topic Consensus uses 25
How do measures fare on different datasets?
Other measures that can reveal quality topics?
#4: Topic modeling --- text summarization
These algorithms are widely used for
#6: Why do I need to measure these topics?
Finding quality topics
Setting value of K in LDA
Choosing the best topic model (LDA, ...)
#7: We need objective measures to evaluate the quality of topics.
#10: Each document gets a score. Can aggregate to get a sense of the model.
This is a measure of the model, by looking at the document.
#11: The Previous measures are good.
Specifically, we are looking at properties of the corpus.
#12: Sections can be like newspaper
Blue is SPORTS Red is BUSINESS
In reality, no topic is going to purely sports or business. Topics are mixtures over these sections.
We want to know how humans can interpret these mixtures.
Sections can be like Twitter
Blue is protest
Red is
This slide just illustrates the process, Ill get into more details later.
This is a TC calculation for ONE TOPIC
#13: Topic Consensus is calculated as...
K is Kullback-Leibler divergence; M is the middle of the distribution
One side effect of using this measure is that lower scores indicate a better consensus.
#16: If you want good topics you might choose 100...., If you want a good model you might choose 25....
The worst from TC are often stopwords topics
Connection to Word Intrusion
Are they really good topics?
#18: Each bar is a group of topics
Bar in the middle is the median
SH does the best ... This is good!
Random does the worse ... This is also good!
NYT does the worst ... Why?
#19: Is it possible to find a way to address all of these drawbacks?
Explain the remainder of this paper here.
#20: These are methods used throughout the literature to measure topic quality, we repeat them here.