The document discusses natural language processing techniques for analyzing blogs and other documents. It covers topics like sentence boundary detection, part-of-speech tagging, and text summarization algorithms. Specifically, it mentions Luhn's summarization algorithm which scores sentences based on the frequency of important words squared divided by the total number of words in the sentence.
1 of 18
Download to read offline
More Related Content
Mining the social web ch8 - 1
1. Mining The Social Web
Ch8 Blogs et al.: Natural Language
Processing (and Beyond)
覦 : 蟾郁鍵
れ企 ろ碁ゼ 蠖蠑碁
http://Cafe.naver.com/architect1
12. Natural Language
Processing
def cleanHtml(html):
return BeautifulStoneSoup(clean_html(html),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
fp = feedparser.parse(FEED_URL)
print "Fetched %s entries from '%s'" %
(len(fp.entries[0].title), fp.feed.title)
blog_posts = []
for e in fp.entries:
blog_posts.append({'title': e.title, 'content'
: cleanHtml(e.content[0].value), 'link': e.links[0].href})
13. Natural Language
Processing
# Basic stats
num_words = sum([i[1] for i in fdist.items()])
num_unique_words = len(fdist.keys())
# Hapaxes are words that appear only once
num_hapaxes = len(fdist.hapaxes())
top_10_words_sans_stop_words = [w for w in fdist.items()
if w[0] not in stop_words][:10]
print post['title']
print 'tNum Sentences:'.ljust(25), len(sentences)
print 'tNum Words:'.ljust(25), num_words
print 'tNum Unique Words:'.ljust(25), num_unique_words
print 'tNum Hapaxes:'.ljust(25), num_hapaxes
print 'tTop 10 Most Frequent Words (sans stop words):ntt',
'ntt'.join(['%s (%s)
% (w[0], w[1]) for w in top_10_words_sans_stop_words])
print
15. Natural Language
Processing
# Summaization Approach 1:
# Filter out non-significant sentences by using the average
score plus a
# fraction of the std dev as a filter
avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in
scored_sentences if score > avg + 0.5 * std]
# Summarization Approach 2:
# Another approach would be to return only the top N ranked
sentences
top_n_scored = sorted(scored_sentences, key=lambda s:
s[1])[-TOP_SENTENCES:]
top_n_scored = sorted(top_n_scored, key=lambda s: s[0])