狠狠撸

狠狠撸Share a Scribd company logo
BEGINNING TEXT ANALYSIS
Barry DeCicco
Ann Arbor Chapter of the American Statistical
Association
April 22, 2020
CONTENTS
?Sentiment Scoring with TextBlob.
?Predicting Categories with
Machine Learning, using NLTK
and scikit-learn.
CREDITS
(UP FRONT!)
?Almost everything I’ve learned about text analytics I
learned from posters at Medium.com, particularly their
section ‘Towards Data Science’.
?Medium.com has a $5/year subscription, which for the
knowledge I’ve gained is a better value than most free
resources.
SENTIMENT SCORING
Using TextBlob
WHAT IS SENTIMENT SCORING?
?This means assigning a positive/negative score to each
piece of text (e.g., comment in a survey, customer review
for a purchase, etc.).
?These scores can then be tracked over time, or
associated with various cuts in the data (department,
division, product, customer demographic).
?The tool used here will be the Python module TextBlob.
TEXTBLOB
?TextBlob is a Python package which does a lot of things
with text:
? Spelling correction
? Noun phrase extraction
? Part-of-speech tagging
? Tokenization (splitting text into words and sentences)
? Sentiment analysis
CREATING A TEXTBLOB
?Install the package.
?In a python program, load it:
? from textblob import TextBlob
?Run it on some text:
CREATING A TEXTBLOB
?text = "Absolutely wonderful - silky and sexy and
comortable“ [note misspelling]
?text_lower=text.lower()
?blob_pre = TextBlob(text_lower)
?blob=blob_pre.correct()
?sentiment = blob.sentiment
?polarity = sentiment.polarity
?subjectivity = sentiment.subjectivity
CREATING A TEXTBLOB - RESULTS
?Absolutely wonderful - silky and sexy and comortable
?absolutely wonderful - silky and sexy and comortable
?absolutely wonderful - silky and sexy and comortable
?absolutely wonderful - silk and sex and comfortable
?Sentiment(polarity=0.7, subjectivity=0.9)
?0.7 [on a scale of -1 to 1]
?0.9
RESULTS
RESULTS (CON.)
BASIC STEPS IN TEXT
ANALYTICS
?If you have a data set with 10,000 comments, you have
close to 10,000 unique values for a variable. That makes
analysis futile, in almost all cases.
?Therefore the text values are tokenized:
? Break text into sentences,
? Break sentences into words,
? ‘Standardize’ the words (e.g., set to root form, singularizing
plurals and setting verbs to present tense, possibly
removing stop words).
TOKENIZATION
?Most comments are unique, resulting in a variable with
mostly unique values. That generally makes analysis futile,
?Therefore the text values are tokenized:
? Break text into sentences,
? Break sentences into words,
? ‘Standardize’ the words (e.g., set to root form, singularizing
plurals and setting verbs to present tense).
?This converts 10,000 unique values into a smaller set of
values. Each text field is now a list of standardized tokens.
COMMENTS ON TOKENIZATION
?There are a variety of tools and methods/settings in Python
to tokenize. This presentation will use NLTK (Natural
Language Tool Kit).
?There are trade-offs
? Stemming trims words to a root, not necessarily
grammatically correct (‘riding’ => ‘rid’).
? Lemmatization attempts to find a good root (‘riding’ =>
‘ride’).
? Spelling correction is far from perfect, and can really slow
down a program, depending on the misspellings.
NLTK PROCESSING
? text = 'Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i
did bc i never would have ordered it online bc it's petite. i bought a petite and am
5'8". i love the length on me- hits just a little below the knee. would definitely be a true
midi on someone who is truly petite.’
? text_fixed = re.sub(r"'",r"'",text) # fix an oddity in import.
? text_lower=text_fixed.lower()
? word_tokens = nltk.word_tokenize(text_lower)
? removing_stopwords = [word for word in word_tokens if word not in stopwords]
? lemmatized_word = [lemmatizer.lemmatize(word) for word in removing_stopwords]
? line = ' '.join(map(str, lemmatized_word))
? print(line)
NLTK PROCESSING - RESULTS
?love dress 's sooo pretty happened find store 'm glad bc
never would ordered online bc 's petite bought petite 5 ' 8
'' love length me- hit little knee would definitely true midi
someone truly petite
Absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silk and sex and comfortable Sen
COUNT VECTORIZATION
?One way to approach the problem of predictors is to
create a set of predictors based on the tokens for each
comment.
?A dictionary is compiled for the ‘words’ (tokens) in the set
of comments, and a number is assigned to each token.
The set of numbers and counts can be used as predictors
for each comment.
?Two common ways are:
? Count vectorization.
? Tf-idf vectorization.
TF-IDS VECTORIZATION
?An importance weight can be assigned to each token.
The Term Frequency-Inverse Document Frequency
method.
?In this method, higher terms counts within a comment
(‘document’) make the token more significant, but higher
counts for that token in the entire set of comments
(documents) make it less important.
TF-IDS VECTORIZATION (CONTINUED
?The concept is that a token which appears a lot in a given
comment (‘document’) gets upweighted: Term
Frequency.
?However, the more commonly that token appears in the
overall set of comments, it gets down weighted: Inverse
document frequency.
?For example, ‘the’, ‘and’, ‘or’ would generally get a very
low weight. This could be used to automatically disregard
stop words.
EXAMPLE OF TF-IDF VECTORIZATION
?When the data set is divided into 2/3 training data and 1/3
test data, there are 15,160 rows and 1 column.
?After vectorization, there are 15,160 rows by 10,846
columns.
MACHINE LEARNING
?At this point, the vectorized data can be used in any
machine learning method.
?You can also explore the resulting models, to find out the
important tokens.
TOPIC MODELING
?There are a number of methods to explore text to find
cluster and groups (‘topics’).
QUESTIONS?
REFERENCES
?TextBlob:
? Introducing TextBlob
(https://towardsdatascience.com/having-
fun-with-textblob-7e9eed783d3f)
? Tutorial: QuickStart
(https://textblob.readthedocs.io/en/dev/)
REFERENCES
?Sentiment Scoring:
? Statistical Sentiment-Analysis for Survey Data
using Python
(https://towardsdatascience.com/statistical-
sentiment-analysis-for-survey-data-using-
python-9c824ef0c9b0)
? Opinion Mining Of Survey Comments
(https://towardsdatascience.com/https-
medium-com-sacharath-opinion-
mining-of-survey-comments-
14e3fc902b10)
REFERENCES
? A comparison of methods
? NLP Pipeline: Word Tokenization (Part 1) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-
4b2b547e6a3)
? NLP Pipeline: Part of Speech (Part 2) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-part-of-speech-part-2-
b683c90e327d)
? NLP Pipeline: Lemmatization (Part 3) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-lemmatization-part-3-
4bfd7304957)
? NLP Pipeline: Stemming (Part 4) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-stemming-part-4-
b60a319fd52)
? NLP Pipeline: Stop words (Part 5) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-stop-words-part-5-
d6770df8a936)
? NLP Pipeline: Sentence Tokenization (Part 6) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-sentence-tokenization-part-6-
86ed55b185e6)
REFERENCES
?NLTK, Tokenizing, etc.:
? NLTK documentation (https://www.nltk.org/)
? Tutorial: Extracting Keywords with TF-IDF and
Python’s Scikit-Learn (https://kavita-
ganesan.com/extracting-keywords-from-
text-tfidf/#.Xp9NsZl7mUl)
? Tf-idf (https://en.wikipedia.org/wiki/Tf-idf)
? Scikit-learn site, ‘Working With Text Data’
(https://scikit-
learn.org/stable/tutorial/text_analytics/work
ing_with_text_data.html)

More Related Content

What's hot (10)

Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
Gopi Krishnan Nambiar
?
Descriptions
DescriptionsDescriptions
Descriptions
J'ette Novakovich
?
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGEPRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
kevig
?
Computation Chapter 4
Computation Chapter 4Computation Chapter 4
Computation Chapter 4
Inocentshuja Ahmad
?
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONIMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
csandit
?
The NuGram dynamic grammar language
The NuGram dynamic grammar languageThe NuGram dynamic grammar language
The NuGram dynamic grammar language
Nu Echo Inc.
?
Definition
DefinitionDefinition
Definition
ProfVonEuw
?
An ABNF Primer
An ABNF PrimerAn ABNF Primer
An ABNF Primer
Nu Echo Inc.
?
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
Tae Hwan Jung
?
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
?
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGEPRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
kevig
?
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONIMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
csandit
?
The NuGram dynamic grammar language
The NuGram dynamic grammar languageThe NuGram dynamic grammar language
The NuGram dynamic grammar language
Nu Echo Inc.
?
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
Tae Hwan Jung
?
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
?

Similar to Beginning text analysis (20)

05_nlp_Vectorization_ML_model_in_text_analysis.pdf
05_nlp_Vectorization_ML_model_in_text_analysis.pdf05_nlp_Vectorization_ML_model_in_text_analysis.pdf
05_nlp_Vectorization_ML_model_in_text_analysis.pdf
ReemaAsker1
?
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
Rohan Chikorde
?
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
?
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
Manish Mishra
?
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
Glenn De Backer
?
FinalReport
FinalReportFinalReport
FinalReport
Benjamin LeRoy
?
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Derek Kane
?
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
Alyona Medelyan
?
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
Massimo Schenone
?
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
nikshaikh786
?
NLP WITH NA?VE BAYES CLASSIFIER (1).pptx
NLP WITH NA?VE BAYES CLASSIFIER (1).pptxNLP WITH NA?VE BAYES CLASSIFIER (1).pptx
NLP WITH NA?VE BAYES CLASSIFIER (1).pptx
rohithprabhas1
?
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
Manohar Swamynathan
?
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
ankit_ppt
?
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
?
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltk
Janu Jahnavi
?
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
Datamining Tools
?
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
DataminingTools Inc
?
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
guest0edcaf
?
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsThe Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
Lucidworks
?
IRJET- A System for Determining Sarcasm in Tweets: Sarcasm Detector
IRJET-  	  A System for Determining Sarcasm in Tweets: Sarcasm DetectorIRJET-  	  A System for Determining Sarcasm in Tweets: Sarcasm Detector
IRJET- A System for Determining Sarcasm in Tweets: Sarcasm Detector
IRJET Journal
?
05_nlp_Vectorization_ML_model_in_text_analysis.pdf
05_nlp_Vectorization_ML_model_in_text_analysis.pdf05_nlp_Vectorization_ML_model_in_text_analysis.pdf
05_nlp_Vectorization_ML_model_in_text_analysis.pdf
ReemaAsker1
?
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
?
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Derek Kane
?
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
Alyona Medelyan
?
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
nikshaikh786
?
NLP WITH NA?VE BAYES CLASSIFIER (1).pptx
NLP WITH NA?VE BAYES CLASSIFIER (1).pptxNLP WITH NA?VE BAYES CLASSIFIER (1).pptx
NLP WITH NA?VE BAYES CLASSIFIER (1).pptx
rohithprabhas1
?
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
ankit_ppt
?
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
?
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltk
Janu Jahnavi
?
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
guest0edcaf
?
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsThe Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
Lucidworks
?
IRJET- A System for Determining Sarcasm in Tweets: Sarcasm Detector
IRJET-  	  A System for Determining Sarcasm in Tweets: Sarcasm DetectorIRJET-  	  A System for Determining Sarcasm in Tweets: Sarcasm Detector
IRJET- A System for Determining Sarcasm in Tweets: Sarcasm Detector
IRJET Journal
?

More from Barry DeCicco (7)

Easy HTML Tables in RStudio with Tabyl and kableExtra
Easy HTML Tables in RStudio with Tabyl and kableExtraEasy HTML Tables in RStudio with Tabyl and kableExtra
Easy HTML Tables in RStudio with Tabyl and kableExtra
Barry DeCicco
?
Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06
Barry DeCicco
?
Up and running with python
Up and running with pythonUp and running with python
Up and running with python
Barry DeCicco
?
Using RStudio on AWS
Using RStudio on AWSUsing RStudio on AWS
Using RStudio on AWS
Barry DeCicco
?
Calling python from r
Calling python from rCalling python from r
Calling python from r
Barry DeCicco
?
Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)
Barry DeCicco
?
Calling r from sas (msug meeting, feb 17, 2018) revised
Calling r from sas (msug meeting, feb 17, 2018)   revisedCalling r from sas (msug meeting, feb 17, 2018)   revised
Calling r from sas (msug meeting, feb 17, 2018) revised
Barry DeCicco
?
Easy HTML Tables in RStudio with Tabyl and kableExtra
Easy HTML Tables in RStudio with Tabyl and kableExtraEasy HTML Tables in RStudio with Tabyl and kableExtra
Easy HTML Tables in RStudio with Tabyl and kableExtra
Barry DeCicco
?
Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06
Barry DeCicco
?
Up and running with python
Up and running with pythonUp and running with python
Up and running with python
Barry DeCicco
?
Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)
Barry DeCicco
?
Calling r from sas (msug meeting, feb 17, 2018) revised
Calling r from sas (msug meeting, feb 17, 2018)   revisedCalling r from sas (msug meeting, feb 17, 2018)   revised
Calling r from sas (msug meeting, feb 17, 2018) revised
Barry DeCicco
?

Recently uploaded (20)

"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT
miao22
?
CH. 4.pptxt and I will be there in about
CH. 4.pptxt and I will be there in aboutCH. 4.pptxt and I will be there in about
CH. 4.pptxt and I will be there in about
miesoabdela57
?
Lesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptxLesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptx
1045858
?
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo GuruThe Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
kenyoncenteno12
?
Design Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AIDesign Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AI
aaronmwinters
?
Introduction to Java Programming for High School by 狠狠撸sgo.pptx
Introduction to Java Programming for High School by 狠狠撸sgo.pptxIntroduction to Java Programming for High School by 狠狠撸sgo.pptx
Introduction to Java Programming for High School by 狠狠撸sgo.pptx
mirhuzaifahali
?
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
rossanthonytan130
?
Cost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheet
supreetk82004
?
Presentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysisPresentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysis
vatsalsingla4
?
A Relative Information Gain-based Query Performance Prediction Framework with...
A Relative Information Gain-based Query Performance Prediction Framework with...A Relative Information Gain-based Query Performance Prediction Framework with...
A Relative Information Gain-based Query Performance Prediction Framework with...
suchanadatta3
?
Data-Models-in-DBMS-An-Overview.pptx.pptx
Data-Models-in-DBMS-An-Overview.pptx.pptxData-Models-in-DBMS-An-Overview.pptx.pptx
Data-Models-in-DBMS-An-Overview.pptx.pptx
hfebxtveyjxavhx
?
data mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmglegdata mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmgleg
1052LaxmanrajS
?
100680-05-Eucharist_Orientation_Sessions.pdf
100680-05-Eucharist_Orientation_Sessions.pdf100680-05-Eucharist_Orientation_Sessions.pdf
100680-05-Eucharist_Orientation_Sessions.pdf
jacobdivina9
?
Introduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdfIntroduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdf
messagetome133
?
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
taqyed
?
Optimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with CalciteOptimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with Calcite
Stamatis Zampetakis
?
iam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptxiam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptx
muhweziart
?
The Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability AnalyticsThe Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability Analytics
christophercamposus1
?
Lecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptxLecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptx
humairafatima22
?
MLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptxMLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptx
FaizaKhan720183
?
"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT
miao22
?
CH. 4.pptxt and I will be there in about
CH. 4.pptxt and I will be there in aboutCH. 4.pptxt and I will be there in about
CH. 4.pptxt and I will be there in about
miesoabdela57
?
Lesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptxLesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptx
1045858
?
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo GuruThe Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
kenyoncenteno12
?
Design Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AIDesign Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AI
aaronmwinters
?
Introduction to Java Programming for High School by 狠狠撸sgo.pptx
Introduction to Java Programming for High School by 狠狠撸sgo.pptxIntroduction to Java Programming for High School by 狠狠撸sgo.pptx
Introduction to Java Programming for High School by 狠狠撸sgo.pptx
mirhuzaifahali
?
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
rossanthonytan130
?
Cost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheet
supreetk82004
?
Presentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysisPresentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysis
vatsalsingla4
?
A Relative Information Gain-based Query Performance Prediction Framework with...
A Relative Information Gain-based Query Performance Prediction Framework with...A Relative Information Gain-based Query Performance Prediction Framework with...
A Relative Information Gain-based Query Performance Prediction Framework with...
suchanadatta3
?
Data-Models-in-DBMS-An-Overview.pptx.pptx
Data-Models-in-DBMS-An-Overview.pptx.pptxData-Models-in-DBMS-An-Overview.pptx.pptx
Data-Models-in-DBMS-An-Overview.pptx.pptx
hfebxtveyjxavhx
?
data mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmglegdata mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmgleg
1052LaxmanrajS
?
100680-05-Eucharist_Orientation_Sessions.pdf
100680-05-Eucharist_Orientation_Sessions.pdf100680-05-Eucharist_Orientation_Sessions.pdf
100680-05-Eucharist_Orientation_Sessions.pdf
jacobdivina9
?
Introduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdfIntroduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdf
messagetome133
?
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
taqyed
?
Optimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with CalciteOptimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with Calcite
Stamatis Zampetakis
?
iam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptxiam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptx
muhweziart
?
The Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability AnalyticsThe Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability Analytics
christophercamposus1
?
Lecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptxLecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptx
humairafatima22
?
MLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptxMLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptx
FaizaKhan720183
?

Beginning text analysis

  • 1. BEGINNING TEXT ANALYSIS Barry DeCicco Ann Arbor Chapter of the American Statistical Association April 22, 2020
  • 2. CONTENTS ?Sentiment Scoring with TextBlob. ?Predicting Categories with Machine Learning, using NLTK and scikit-learn.
  • 3. CREDITS (UP FRONT!) ?Almost everything I’ve learned about text analytics I learned from posters at Medium.com, particularly their section ‘Towards Data Science’. ?Medium.com has a $5/year subscription, which for the knowledge I’ve gained is a better value than most free resources.
  • 5. WHAT IS SENTIMENT SCORING? ?This means assigning a positive/negative score to each piece of text (e.g., comment in a survey, customer review for a purchase, etc.). ?These scores can then be tracked over time, or associated with various cuts in the data (department, division, product, customer demographic). ?The tool used here will be the Python module TextBlob.
  • 6. TEXTBLOB ?TextBlob is a Python package which does a lot of things with text: ? Spelling correction ? Noun phrase extraction ? Part-of-speech tagging ? Tokenization (splitting text into words and sentences) ? Sentiment analysis
  • 7. CREATING A TEXTBLOB ?Install the package. ?In a python program, load it: ? from textblob import TextBlob ?Run it on some text:
  • 8. CREATING A TEXTBLOB ?text = "Absolutely wonderful - silky and sexy and comortable“ [note misspelling] ?text_lower=text.lower() ?blob_pre = TextBlob(text_lower) ?blob=blob_pre.correct() ?sentiment = blob.sentiment ?polarity = sentiment.polarity ?subjectivity = sentiment.subjectivity
  • 9. CREATING A TEXTBLOB - RESULTS ?Absolutely wonderful - silky and sexy and comortable ?absolutely wonderful - silky and sexy and comortable ?absolutely wonderful - silky and sexy and comortable ?absolutely wonderful - silk and sex and comfortable ?Sentiment(polarity=0.7, subjectivity=0.9) ?0.7 [on a scale of -1 to 1] ?0.9
  • 12. BASIC STEPS IN TEXT ANALYTICS
  • 13. ?If you have a data set with 10,000 comments, you have close to 10,000 unique values for a variable. That makes analysis futile, in almost all cases. ?Therefore the text values are tokenized: ? Break text into sentences, ? Break sentences into words, ? ‘Standardize’ the words (e.g., set to root form, singularizing plurals and setting verbs to present tense, possibly removing stop words).
  • 14. TOKENIZATION ?Most comments are unique, resulting in a variable with mostly unique values. That generally makes analysis futile, ?Therefore the text values are tokenized: ? Break text into sentences, ? Break sentences into words, ? ‘Standardize’ the words (e.g., set to root form, singularizing plurals and setting verbs to present tense). ?This converts 10,000 unique values into a smaller set of values. Each text field is now a list of standardized tokens.
  • 15. COMMENTS ON TOKENIZATION ?There are a variety of tools and methods/settings in Python to tokenize. This presentation will use NLTK (Natural Language Tool Kit). ?There are trade-offs ? Stemming trims words to a root, not necessarily grammatically correct (‘riding’ => ‘rid’). ? Lemmatization attempts to find a good root (‘riding’ => ‘ride’). ? Spelling correction is far from perfect, and can really slow down a program, depending on the misspellings.
  • 16. NLTK PROCESSING ? text = 'Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.’ ? text_fixed = re.sub(r"'",r"'",text) # fix an oddity in import. ? text_lower=text_fixed.lower() ? word_tokens = nltk.word_tokenize(text_lower) ? removing_stopwords = [word for word in word_tokens if word not in stopwords] ? lemmatized_word = [lemmatizer.lemmatize(word) for word in removing_stopwords] ? line = ' '.join(map(str, lemmatized_word)) ? print(line)
  • 17. NLTK PROCESSING - RESULTS ?love dress 's sooo pretty happened find store 'm glad bc never would ordered online bc 's petite bought petite 5 ' 8 '' love length me- hit little knee would definitely true midi someone truly petite Absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silk and sex and comfortable Sen
  • 18. COUNT VECTORIZATION ?One way to approach the problem of predictors is to create a set of predictors based on the tokens for each comment. ?A dictionary is compiled for the ‘words’ (tokens) in the set of comments, and a number is assigned to each token. The set of numbers and counts can be used as predictors for each comment. ?Two common ways are: ? Count vectorization. ? Tf-idf vectorization.
  • 19. TF-IDS VECTORIZATION ?An importance weight can be assigned to each token. The Term Frequency-Inverse Document Frequency method. ?In this method, higher terms counts within a comment (‘document’) make the token more significant, but higher counts for that token in the entire set of comments (documents) make it less important.
  • 20. TF-IDS VECTORIZATION (CONTINUED ?The concept is that a token which appears a lot in a given comment (‘document’) gets upweighted: Term Frequency. ?However, the more commonly that token appears in the overall set of comments, it gets down weighted: Inverse document frequency. ?For example, ‘the’, ‘and’, ‘or’ would generally get a very low weight. This could be used to automatically disregard stop words.
  • 21. EXAMPLE OF TF-IDF VECTORIZATION ?When the data set is divided into 2/3 training data and 1/3 test data, there are 15,160 rows and 1 column. ?After vectorization, there are 15,160 rows by 10,846 columns.
  • 22. MACHINE LEARNING ?At this point, the vectorized data can be used in any machine learning method. ?You can also explore the resulting models, to find out the important tokens.
  • 23. TOPIC MODELING ?There are a number of methods to explore text to find cluster and groups (‘topics’).
  • 26. REFERENCES ?Sentiment Scoring: ? Statistical Sentiment-Analysis for Survey Data using Python (https://towardsdatascience.com/statistical- sentiment-analysis-for-survey-data-using- python-9c824ef0c9b0) ? Opinion Mining Of Survey Comments (https://towardsdatascience.com/https- medium-com-sacharath-opinion- mining-of-survey-comments- 14e3fc902b10)
  • 27. REFERENCES ? A comparison of methods ? NLP Pipeline: Word Tokenization (Part 1) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1- 4b2b547e6a3) ? NLP Pipeline: Part of Speech (Part 2) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-part-of-speech-part-2- b683c90e327d) ? NLP Pipeline: Lemmatization (Part 3) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-lemmatization-part-3- 4bfd7304957) ? NLP Pipeline: Stemming (Part 4) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-stemming-part-4- b60a319fd52) ? NLP Pipeline: Stop words (Part 5) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-stop-words-part-5- d6770df8a936) ? NLP Pipeline: Sentence Tokenization (Part 6) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-sentence-tokenization-part-6- 86ed55b185e6)
  • 28. REFERENCES ?NLTK, Tokenizing, etc.: ? NLTK documentation (https://www.nltk.org/) ? Tutorial: Extracting Keywords with TF-IDF and Python’s Scikit-Learn (https://kavita- ganesan.com/extracting-keywords-from- text-tfidf/#.Xp9NsZl7mUl) ? Tf-idf (https://en.wikipedia.org/wiki/Tf-idf) ? Scikit-learn site, ‘Working With Text Data’ (https://scikit- learn.org/stable/tutorial/text_analytics/work ing_with_text_data.html)