A version of my OpenVisConf talk "Bones of a Bestseller" that gives more detail on topic analysis plus adds python code. Blog post and ipynb code here: http://blogger.ghostweather.com/2013/08/pydata-boston-2013-more-on-fiction.html
1 of 57
Downloaded 49 times
More Related Content
Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)
4. THEVIDEO OF THAT TALK:
http://blogger.ghostweather.com/2013/06/analysis-of-鍖ction-
my-openvisconf-talk.html
http://www.youtube.com/watch?
v=f41U936WqPM
BASED ON A PREVIOUS
TALK:
This talk focuses on some more technical details and more on topic analysis.
The IPython notebook of code samples for this lives here:
http://ghostweather.com/essays/talks/openvisconf/Pydata_Code.ipynb
6. Text Classification (Commonly)
則рBag of words each document is considered
a collection of words, independent of order
則рFrequencies of certain words are used to
identify the texts
Seems like this should work with sex scenes,
right? Only so many body parts and behaviors,
right?!
7. Data
Label
Estdsgfd fdsatreatret dfds
Yes
Dsrdsf drerear ewrewtrew
No
Reret retdrtd rewrewrtew
Yes
Dsfgdg fdsfd
Yes
Algorithm
Train
Test
New data in the wild
8. Sex Scene Detection First Steps
1. Buy 50 Shades on Amazon, unlock text in
Calibre, save as TXT 鍖le.
2. Cut up a doc into 500 word chunks using
Python
10. Would you like to sit? He waves me toward an L-shaped white leather couch.
His of鍖ce is way too big for just one man. In front of the 鍖oor-to-ceiling windows, theres a
modern dark wood desk that six people could comfortably eat around. It matches the
coffee table by the couch. Everything else is whiteceiling, 鍖oors, and walls, except for the
wall by the door, where a mosaic of small paintings hang, thirty-six of them arranged in a
square.They are exquisitea series of mundane, forgotten objects painted in such precise
detail they look like photographs. Displayed together, they are breathtaking.
A local artist.Trouton, says Grey when he catches my gaze.
Theyre lovely. Raising the ordinary to extraordinary, I murmur, distracted both by him
and the paintings. He cocks his head to one side and regards me intently.
I couldnt agree more, Miss Steele, he replies, his voice soft, and for some inexplicable
reason I 鍖nd myself blushing.
Sample of 50 Shades of Grey
21. On to the learning algorithm
So, the training data:
-The text chunks
-The score the raters gave it (averaged) as truth
I started with Pythons NLTK (Natural Language
Toolkit) and Na誰ve Bayes for classifying (working
in an ipython notebook).
22. Resources on NLTK Na誰ve Bayes
則рThe NLTK book chapter:
http://nltk.googlecode.com/svn/trunk/doc/
book/ch06.html
則рJacob Perkins example of sentiment analysis
with NLTK:
http://streamhacker.com/2010/05/10/text-
classi鍖cation-sentiment-analysis-naive-bayes-
classi鍖er/
23. Perkins NLTK code for this
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats),
len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
26. Previously with less pos data: not so
great at 68%
packet (they use a lot of condoms)
27. Pythons sklearn (scikit-learn)
Lots of classi鍖ers
for sparse data like
text!
http://scikit-learn.org/0.13/auto_examples/
document_classi鍖cation_20newsgroups.html
28. Using a lemmatizer step in the pipeline (to strip endings off words, since some 鍖ction in my
later samples was in present tense)
Pipelines in sklearn makes it incredibly easy to run lots of experiments.
Fit the model, using training data and target answers (in this case,50 Shades of Grey)
Test the model on new data (in this case,50 Shades Darker). Check how it did against the
answers.
Now
were
at 88%
29. Interpreting the results
Lets make a tool!
Demo:
http://www.ghostweather.com/essays/talks/openvisconf/text_scores/
rollover.html
30. Really amazing P.S. here
I paid for coding of a bunch of fan-鍖ction for sex
scenes too, and fed them in to the sklearn SGD
classi鍖er.
(Note that 50 Shades started life as Twilight
fan鍖c.)
*cross-validating with entire set, not just 50 Shades books.
97% accuracy achieved!*
32. Almost naked, Silas hurled his pale body down the staircase. He knew he
had been betrayed, but by whom? When he reached the foyer, more
officers were surging through the front door. Silas turned the other way
and dashed deeper into the residence hall.The women's entrance. Every
Opus Dei building has one.Winding down narrow hallways, Silas snaked
through a kitchen, past terrified workers, who left to avoid the naked
albino as he knocked over bowls and silverware, bursting into a dark
hallway near the boiler room. He now saw the door he sought, an exit light
gleaming at the end.
Running full speed through the door out into the rain, Silas leapt off the
low landing, not seeing the officer coming the other way until it was too
late.The two men collided, Silas's broad, naked shoulder grinding into the
man's sternum with crushing force. He drove the officer backward onto the
pavement, landing hard on top of him.The officer's gun clattered away.
Silas could hear men running down the hall shouting. Rolling, he grabbed
the loose gun just as the officers emerged. A shot rang out on the stairs,
and Silas felt a searing pain below his ribs. Filled with rage, he opened
fire at all three officers, their blood spraying.
A dark shadow loomed behind, coming out of nowhere.The angry
hands that grabbed at his bare shoulders felt as if they were infused with
the power of the devil himself.The man roared in his ear. SILAS, NO!
Silas spun and fired.Their eyes met. Silas was already screaming in
Chapter 96
DaVinci Code
34. Resources for Topic Analysis
則рDavid Mimnos java Mallet is the one everyone
uses:
-http://mallet.cs.umass.edu/index.php
-The R mallet package is rather nice, too:
http://www.cs.princeton.edu/~mimno/R/
-This is a GUI wrapper for mallet that outputs nice csv
and html pages:
https://code.google.com/p/topic-modeling-tool/
則рSome pure python (and C) implementations (toy
code, primarily) are listed on Bleis website:
http://www.cs.princeton.edu/~blei/
topicmodeling.html
37. Pros/Cons vs CMD-Line Mallet
Pros
則р Allows stopword 鍖le
specifying
則р Produces csv and html
output in a near dir
structure
則р Has a GUI (simpler to just
get going)
Cons
則р Runs with defaults, so no
optimize-interval or other
cmd line options
則р No diagnostic output (a
command-line option)
則р Not super-well docd
Tutorial on cmd line usage:
http://programminghistorian.org/lessons/topic-modeling-and-
mallet
48. Maybe I need One More Tool. Any word relations of interest?
Lets try another hairball
Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_words_network/index.html
50. Another tool:
DaVinci Code topics to
chapters mapping
Excitement rating color scale
avg by chapter, ordered
(obviously)
Topics (48ish) per
chapter (108)
Chapter 1 to Chapter 108
51. Ah, but since its svg/d3
var chart = chart.append("g").attr("translate","0," +
y).attr("transform","rotate(90 600 600)");
But, maybe I need chapter
summaries. So I can relate
them to the topics?
52. Add some topic-tooltips
and fade-outs.
Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_arc_diagram/TopicArc.html
53. But what did this
show?
Some topics are just neither exciting nor
dull topic clustering (as I did it) had little
to do with action scenes. Its slightly helpful
for topics, though J
These nodes are shaded from
gray (dull) to red (exciting)
54. Coming soon
Color words in texts by topic assigment, to help
tune the stopwords and set up next steps:
≒ Pre-process text for just the verbs?
≒ Clean out a class of proper names
≒ Extract sentences containing the topic words
to help describe the topics/texts better
55. Wrapping up
則рPython is great for the data munging and
analysis
則рSome analysis needs serious vis support
則рSave yourself some work in javascript using
Python before you get into js
則рD3 is a great tool for iterative interactive
exploration of your analysis results
56. THANKS!
@arnicas, Lynn@ghostweather.com
My thanks to.
Luminosity for help with Dan Brown summaries, JimVallandingham (@vlandham)
for network parameter and coffeescript help.
Hey, I am a consultant for data analysis and visualization. Look me up!
57. A Few More References
則р Applied Machine Learning with Scikit-Learn:
http://scikit-learn.github.io/scikit-learn-tutorial/index.html
則р Na誰ve Bayes for text in Scikit-Learn:
http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes
則р Stochastic Gradient Descent in Scikit-Learn: http://scikit-learn.org/0.13/modules/sgd.html
則р Nice tutorial overview of working with text data:
scikit-learn.github.io/scikit-learn-tutorial/working_with_text_data.html
則р Bearcart by Rob Story Rickshaw timeseries graphs from python pandas datastructure in 4
lines (https://github.com/wrobstory/bearcart)
則р LDA topic modeling tool with UI - https://code.google.com/p/topic-modeling-tool/
則р Scott Weingarts nice overview of LDA Topic Modeling in Digital Humanities:
http://www.scottbot.net/HIAL/?p=221
則р Elijah Meeks lovely set of articles on LDA & Digital Humanties vis:
https://dhs.stanford.edu/comprehending-the-digital-humanities/
則р JimVallandinghams tooltip code and a great demo/tutorial:
http://鍖owingdata.com/2012/08/02/how-to-make-an-interactive-network-visualization/
則р Rickshaw for timeseries graphs: https://github.com/shutterstock/rickshaw