狠狠撸

狠狠撸Share a Scribd company logo
Intro 2 text analytics | Ben Taylor @bentaylordata
Text Analytics Are Awesome!
Thank you to our Sponsors!
HIREVUE | TALENT INTERACTION
Agenda
SPAM
Levenshtein distance (word, sentence, cloud)
2
3
4
Text handling, introduction1
Map Reduce / Clustering5
Interview text analytics6
Sentiment
Text handling
Input not expected?
HIREVUE | TALENT INTERACTION
HIREVUE | TALENT INTERACTION
Model ?
M
Input Output
HIREVUE | TALENT INTERACTION
Model ?
M
Input
HIREVUE | TALENT INTERACTION
Model ?
M
Input Output
Stderr:
You’re an idiot &
I don’t like you anymore
HIREVUE | TALENT INTERACTION
Input
HIREVUE | TALENT INTERACTION @BENTAYLORDATA
HIREVUE | TALENT INTERACTION
HIREVUE | TALENT INTERACTION
HIREVUE | TALENT INTERACTION @BENTAYLORDATA
Need to map unstructured text to summary metric
Sentiment
How are you feeling?
HIREVUE | TALENT INTERACTION
HIREVUE | TALENT INTERACTION
Let’s make this easy.
Problem statement:
Expletives + @skullcandy mention?
Good or bad?
HIREVUE | TALENT INTERACTION
Negative Sentiment
? 1048940088:
? "I've got two pairs of Ink'd earbuds by @Skullcandy and they both broke in two weeks. I
$#@&ing hate @Skullcandy! #$#@&You”
? 1054044204:
? “$#@& only one headphone stopped working stupid $#@&ing headphones y is it only one
headphone i blame you @skullcandy”
? 1376767884:
? "@skullcandy never buyin another pair of skull candy headphones this is the fourth pair in the
last 2 months that $#@&ed up”
? 141343855:
? “My headphones blew $#@& you skullcandy -___-”
? 16352011:
? “BAHHHHH My SkullCandys are $#@&ing up AGAIN!”
? 1376767884:
? "@skullcandy $#@& skullcandy"
HIREVUE | TALENT INTERACTION
Positive Sentiment
? 161547390:
? "Getting some skullcandy fix's. #tight #skullcandy #$#@&ingpumped"
? 1306207039:
? "@skullcandy @VegasJarhead @justine_mom $#@& yeah!"
? 1117713458:
? "@skullcandy $#@&in bass is badass",
? 1117713458:
? "@skullcandy ur headphones are bad ass and have awsome $#@&in bass"
? 1086228384:
? "Just bough a pair of Skullcandy supreme sound Hesh's $#@&ING AWSOME!!! the bass is
truly amazing :)"
? 132303540:
? "@K$#@&INGP I thought you were a man not a pussy. Try Skullcandy. Hit me back and I'll
hook you up."
HIREVUE | TALENT INTERACTION
Neutral Sentiment
? 1104061464:
? "@autoerotique @skullcandy #crushers First pair
died after 2 days. Day 2 for new pair. The Alarm is
thrashing my head, un$#@&me these rock”
HIREVUE | TALENT INTERACTION
Conclusion
Sentiment Classification Count
Negative 6
Positive 6
Neutral 1
46% chance tweet is negative, now what?
Welcome to the majority of the sentiment solutions on
the market:
Single-word na?ve Bayesian classification
HIREVUE | TALENT INTERACTION
Positive Sentiment (second pass)
? 161547390:
? "Getting some skullcandy fix's. #tight #skullcandy #$#@&ingpumped"
? 1306207039:
? "@skullcandy @VegasJarhead @justine_mom $#@& yeah!"
? 1117713458:
? "@skullcandy $#@&in bass is badass",
? 1117713458:
? "@skullcandy ur headphones are bad ass and have awsome $#@&in bass"
? 1086228384:
? "Just bough a pair of Skullcandy supreme sound Hesh's $#@&ING AWSOME!!! the bass is truly amazing :)"
? 132303540:
? "@K$#@&INGP I thought you were a man not a pussy. Try Skullcandy. Hit me back and I'll hook you up.”
? 1104061464:
? "@autoerotique @skullcandy #crushers First pair died after 2 days. Day 2 for new pair. The Alarm is
thrashing my head, un$#@&me these rock”
HIREVUE | TALENT INTERACTION
Conclusion
Sentiment Classification Count
Negative 6
Positive ~0
Neutral ~0
~100% chance tweet is negative with tuple assistance. How to find complex
tuples automatically!?
Bayesian bootstrap matrix
Unique words in training cloud
Uniquewordsintrainingcloud
HIREVUE | TALENT INTERACTION
Basic sentiment output
Credit: Ben Peters
Keyword Negative positive
warranty 28.7 1
cant 11.8 1
back 11.8 1
break 11.8 1
after 11.1 1
what 9.1 1
never 9.1 1
Don’t 9.1 1
second 8.4 1
side 8.4 1
SPAM
I can’t handle this
HIREVUE | TALENT INTERACTION
HIREVUE | TALENT INTERACTION
Lost future
customer
HIREVUE | TALENT INTERACTION
SPAM examples:
>80%
HIREVUE | TALENT INTERACTION
SPAM list
Keyword spam good
@nikesb 52.0 1
@lrgskate 52.0 1
live 34.0 1
know 1 28.8
have 1 22.3
pair 1 16.3
earbud 16.1 1
Non-ascii-chars 12.4 1
some 1 11.9
check 1 11.6
Credit: Ben Peters
HIREVUE | TALENT INTERACTION
Training….
Where do you get your training set?
What about @#tags? Misspellings? ? ?
HIREVUE | TALENT INTERACTION
Training….
Where do you get your training set?
What about @#tags? Misspellings? ? ? SPAM?
HIREVUE | TALENT INTERACTION
Manual trainer
http://54.186.199.209/
Credit: Ben Peters
Levenshtein
Now things are getting interesting
HIREVUE | TALENT INTERACTION
The things we take for granted
You type: Awsome
Computer: It’s actually spelled Awesome
HIREVUE | TALENT INTERACTION
① kitten → sitten (substitution of "s" for "k")
② sitten → sittin (substitution of "i" for "e")
③ sittin → sitting (insertion of "g" at the end)
Levenshtein word level
Ref:
I am going skiing tomorrow
Hyp:
I am going skiing on Saturday
HIREVUE | TALENT INTERACTION
Levenshtein word-cloud level
Ref:
alphanumeric_sort(word_cloud_1)
alphanumeric_sort(unique(word_cloud_1))
Hyp:
alphanumeric_sort(word_cloud_2)
alphanumeric_sort(unique(word_cloud_2))
HIREVUE | TALENT INTERACTION
>> wer(str1,str1)
ans = 0
>> wer(strjoin(sort(strsplit(str1,' ')),' '),str1)
ans = 15
MapReduce
Great for
Text processing
i.e. word counts
HIREVUE | TALENT INTERACTION
CLUSTERING
Now things are getting interesting
HIREVUE | TALENT INTERACTION
Group of tweets?
? Once we have categorized tweets we can
build word clouds!!!
Category A
(could be negative sentiment,
low selling areas, etc..)
Category B
(could be positive sentiment,
high selling areas, etc..)
words
words
words
words
words
words
words
words
Levenshtein wordcloud similarity
Levenshtein wordcloud similarity
Cluster 1 example
Camping
VirginGamingBattlefield
Cluster 2 example
Skiing
winterStringray
Cluster 3 example
MMABoxing
Skateboarding
Twitter Surgery
- =
Training a blacklist filter
? Acting…
? Getting…
? Holding…
? Going…
? Brings…
? Turning..
Blacklist dictionary

More Related Content

Text analytics intro

  • 1. Intro 2 text analytics | Ben Taylor @bentaylordata Text Analytics Are Awesome!
  • 2. Thank you to our Sponsors!
  • 3. HIREVUE | TALENT INTERACTION Agenda SPAM Levenshtein distance (word, sentence, cloud) 2 3 4 Text handling, introduction1 Map Reduce / Clustering5 Interview text analytics6 Sentiment
  • 4. Text handling Input not expected? HIREVUE | TALENT INTERACTION
  • 5. HIREVUE | TALENT INTERACTION Model ? M Input Output
  • 6. HIREVUE | TALENT INTERACTION Model ? M Input
  • 7. HIREVUE | TALENT INTERACTION Model ? M Input Output Stderr: You’re an idiot & I don’t like you anymore
  • 8. HIREVUE | TALENT INTERACTION Input
  • 9. HIREVUE | TALENT INTERACTION @BENTAYLORDATA
  • 10. HIREVUE | TALENT INTERACTION
  • 11. HIREVUE | TALENT INTERACTION
  • 12. HIREVUE | TALENT INTERACTION @BENTAYLORDATA Need to map unstructured text to summary metric
  • 13. Sentiment How are you feeling? HIREVUE | TALENT INTERACTION
  • 14. HIREVUE | TALENT INTERACTION Let’s make this easy. Problem statement: Expletives + @skullcandy mention? Good or bad?
  • 15. HIREVUE | TALENT INTERACTION Negative Sentiment ? 1048940088: ? "I've got two pairs of Ink'd earbuds by @Skullcandy and they both broke in two weeks. I $#@&ing hate @Skullcandy! #$#@&You” ? 1054044204: ? “$#@& only one headphone stopped working stupid $#@&ing headphones y is it only one headphone i blame you @skullcandy” ? 1376767884: ? "@skullcandy never buyin another pair of skull candy headphones this is the fourth pair in the last 2 months that $#@&ed up” ? 141343855: ? “My headphones blew $#@& you skullcandy -___-” ? 16352011: ? “BAHHHHH My SkullCandys are $#@&ing up AGAIN!” ? 1376767884: ? "@skullcandy $#@& skullcandy"
  • 16. HIREVUE | TALENT INTERACTION Positive Sentiment ? 161547390: ? "Getting some skullcandy fix's. #tight #skullcandy #$#@&ingpumped" ? 1306207039: ? "@skullcandy @VegasJarhead @justine_mom $#@& yeah!" ? 1117713458: ? "@skullcandy $#@&in bass is badass", ? 1117713458: ? "@skullcandy ur headphones are bad ass and have awsome $#@&in bass" ? 1086228384: ? "Just bough a pair of Skullcandy supreme sound Hesh's $#@&ING AWSOME!!! the bass is truly amazing :)" ? 132303540: ? "@K$#@&INGP I thought you were a man not a pussy. Try Skullcandy. Hit me back and I'll hook you up."
  • 17. HIREVUE | TALENT INTERACTION Neutral Sentiment ? 1104061464: ? "@autoerotique @skullcandy #crushers First pair died after 2 days. Day 2 for new pair. The Alarm is thrashing my head, un$#@&me these rock”
  • 18. HIREVUE | TALENT INTERACTION Conclusion Sentiment Classification Count Negative 6 Positive 6 Neutral 1 46% chance tweet is negative, now what? Welcome to the majority of the sentiment solutions on the market: Single-word na?ve Bayesian classification
  • 19. HIREVUE | TALENT INTERACTION Positive Sentiment (second pass) ? 161547390: ? "Getting some skullcandy fix's. #tight #skullcandy #$#@&ingpumped" ? 1306207039: ? "@skullcandy @VegasJarhead @justine_mom $#@& yeah!" ? 1117713458: ? "@skullcandy $#@&in bass is badass", ? 1117713458: ? "@skullcandy ur headphones are bad ass and have awsome $#@&in bass" ? 1086228384: ? "Just bough a pair of Skullcandy supreme sound Hesh's $#@&ING AWSOME!!! the bass is truly amazing :)" ? 132303540: ? "@K$#@&INGP I thought you were a man not a pussy. Try Skullcandy. Hit me back and I'll hook you up.” ? 1104061464: ? "@autoerotique @skullcandy #crushers First pair died after 2 days. Day 2 for new pair. The Alarm is thrashing my head, un$#@&me these rock”
  • 20. HIREVUE | TALENT INTERACTION Conclusion Sentiment Classification Count Negative 6 Positive ~0 Neutral ~0 ~100% chance tweet is negative with tuple assistance. How to find complex tuples automatically!? Bayesian bootstrap matrix Unique words in training cloud Uniquewordsintrainingcloud
  • 21. HIREVUE | TALENT INTERACTION Basic sentiment output Credit: Ben Peters Keyword Negative positive warranty 28.7 1 cant 11.8 1 back 11.8 1 break 11.8 1 after 11.1 1 what 9.1 1 never 9.1 1 Don’t 9.1 1 second 8.4 1 side 8.4 1
  • 22. SPAM I can’t handle this HIREVUE | TALENT INTERACTION
  • 23. HIREVUE | TALENT INTERACTION Lost future customer
  • 24. HIREVUE | TALENT INTERACTION SPAM examples: >80%
  • 25. HIREVUE | TALENT INTERACTION SPAM list Keyword spam good @nikesb 52.0 1 @lrgskate 52.0 1 live 34.0 1 know 1 28.8 have 1 22.3 pair 1 16.3 earbud 16.1 1 Non-ascii-chars 12.4 1 some 1 11.9 check 1 11.6 Credit: Ben Peters
  • 26. HIREVUE | TALENT INTERACTION Training…. Where do you get your training set? What about @#tags? Misspellings? ? ?
  • 27. HIREVUE | TALENT INTERACTION Training…. Where do you get your training set? What about @#tags? Misspellings? ? ? SPAM?
  • 28. HIREVUE | TALENT INTERACTION Manual trainer http://54.186.199.209/ Credit: Ben Peters
  • 29. Levenshtein Now things are getting interesting HIREVUE | TALENT INTERACTION
  • 30. The things we take for granted You type: Awsome Computer: It’s actually spelled Awesome HIREVUE | TALENT INTERACTION ① kitten → sitten (substitution of "s" for "k") ② sitten → sittin (substitution of "i" for "e") ③ sittin → sitting (insertion of "g" at the end)
  • 31. Levenshtein word level Ref: I am going skiing tomorrow Hyp: I am going skiing on Saturday HIREVUE | TALENT INTERACTION
  • 33. MapReduce Great for Text processing i.e. word counts HIREVUE | TALENT INTERACTION
  • 34. CLUSTERING Now things are getting interesting HIREVUE | TALENT INTERACTION
  • 35. Group of tweets? ? Once we have categorized tweets we can build word clouds!!! Category A (could be negative sentiment, low selling areas, etc..) Category B (could be positive sentiment, high selling areas, etc..) words words words words words words words words
  • 42. Training a blacklist filter ? Acting… ? Getting… ? Holding… ? Going… ? Brings… ? Turning.. Blacklist dictionary