際際滷

際際滷Share a Scribd company logo
Mining The Social Web
  Ch8 Blogs et al.: Natural Language
     Processing (and Beyond)      


               覦 : 蟾郁鍵
     れ企 ろ碁ゼ 蠖蠑碁 
     http://Cafe.naver.com/architect1
Natural Language
       Processing
 襷豺襦 覓語レ 豌襴!
Natural Language
       Processing
 襷豺襦 覓語レ 豌襴!
NLP Pipeline With NLTK
        覓語レ  谿剰鍵


         襯願鍵


       蟲覓碁 讌讌蠍(?)


         覩 覿


          豢豢
Natural Language
         Processing
 覓語レ  谿剰鍵(EOS Detection)
Natural Language
         Processing
 覓語レ  谿剰鍵(EOS Detection)
Natural Language
         Processing
 蟲覓碁 讌讌蠍 (POS Tagging)
Natural Language
   Processing
Natural Language
           Processing
 豢豢( Extraction)
Natural Language
   Processing
Natural Language
   Processing
Natural Language
               Processing
def cleanHtml(html):
return BeautifulStoneSoup(clean_html(html),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
fp = feedparser.parse(FEED_URL)
print "Fetched %s entries from '%s'" %
(len(fp.entries[0].title), fp.feed.title)
blog_posts = []
for e in fp.entries:
blog_posts.append({'title': e.title, 'content'
: cleanHtml(e.content[0].value), 'link': e.links[0].href})
Natural Language
               Processing
# Basic stats
num_words = sum([i[1] for i in fdist.items()])
num_unique_words = len(fdist.keys())
# Hapaxes are words that appear only once
num_hapaxes = len(fdist.hapaxes())
top_10_words_sans_stop_words = [w for w in fdist.items()
if w[0] not in stop_words][:10]
print post['title']
print 'tNum Sentences:'.ljust(25), len(sentences)
print 'tNum Words:'.ljust(25), num_words
print 'tNum Unique Words:'.ljust(25), num_unique_words
print 'tNum Hapaxes:'.ljust(25), num_hapaxes
print 'tTop 10 Most Frequent Words (sans stop words):ntt',
'ntt'.join(['%s (%s)
        % (w[0], w[1]) for w in top_10_words_sans_stop_words])
print
Natural Language
   Processing
Natural Language
               Processing
# Summaization Approach 1:
# Filter out non-significant sentences by using the average
score plus a
# fraction of the std dev as a filter

avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in
scored_sentences if score > avg + 0.5 * std]

# Summarization Approach 2:
# Another approach would be to return only the top N ranked
sentences

    top_n_scored = sorted(scored_sentences, key=lambda s:
s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
Natural Language
   Processing
Natural Language
         Processing
 Luhns Summarization Algorithm
   Score = (覓語レ 譴 )^2)/(覓語 豐
    )
Natural Language
         Processing
 Luhns Summarization Algorithm
   Score =

More Related Content

Mining the social web ch8 - 1

  • 1. Mining The Social Web Ch8 Blogs et al.: Natural Language Processing (and Beyond) 覦 : 蟾郁鍵 れ企 ろ碁ゼ 蠖蠑碁 http://Cafe.naver.com/architect1
  • 2. Natural Language Processing 襷豺襦 覓語レ 豌襴!
  • 3. Natural Language Processing 襷豺襦 覓語レ 豌襴!
  • 4. NLP Pipeline With NLTK 覓語レ 谿剰鍵 襯願鍵 蟲覓碁 讌讌蠍(?) 覩 覿 豢豢
  • 5. Natural Language Processing 覓語レ 谿剰鍵(EOS Detection)
  • 6. Natural Language Processing 覓語レ 谿剰鍵(EOS Detection)
  • 7. Natural Language Processing 蟲覓碁 讌讌蠍 (POS Tagging)
  • 8. Natural Language Processing
  • 9. Natural Language Processing 豢豢( Extraction)
  • 10. Natural Language Processing
  • 11. Natural Language Processing
  • 12. Natural Language Processing def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) blog_posts = [] for e in fp.entries: blog_posts.append({'title': e.title, 'content' : cleanHtml(e.content[0].value), 'link': e.links[0].href})
  • 13. Natural Language Processing # Basic stats num_words = sum([i[1] for i in fdist.items()]) num_unique_words = len(fdist.keys()) # Hapaxes are words that appear only once num_hapaxes = len(fdist.hapaxes()) top_10_words_sans_stop_words = [w for w in fdist.items() if w[0] not in stop_words][:10] print post['title'] print 'tNum Sentences:'.ljust(25), len(sentences) print 'tNum Words:'.ljust(25), num_words print 'tNum Unique Words:'.ljust(25), num_unique_words print 'tNum Hapaxes:'.ljust(25), num_hapaxes print 'tTop 10 Most Frequent Words (sans stop words):ntt', 'ntt'.join(['%s (%s) % (w[0], w[1]) for w in top_10_words_sans_stop_words]) print
  • 14. Natural Language Processing
  • 15. Natural Language Processing # Summaization Approach 1: # Filter out non-significant sentences by using the average score plus a # fraction of the std dev as a filter avg = numpy.mean([s[1] for s in scored_sentences]) std = numpy.std([s[1] for s in scored_sentences]) mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences if score > avg + 0.5 * std] # Summarization Approach 2: # Another approach would be to return only the top N ranked sentences top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:] top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
  • 16. Natural Language Processing
  • 17. Natural Language Processing Luhns Summarization Algorithm Score = (覓語レ 譴 )^2)/(覓語 豐 )
  • 18. Natural Language Processing Luhns Summarization Algorithm Score =