�ݺ�ߣ

Mining The Social Web
Ch8 Blogs et al.: Natural Language
Processing (and Beyond) Ⅰ

발표 : 김연기
네이버 아키텍트를 꿈꾸는 사람들
http://Cafe.naver.com/architect1

Natural Language
Processing
• 마침표로 문장을 처리하자!

NLP Pipeline With NLTK
문장의 끝 찾기

단어 자르기

구문별 짝짖기(?)

단어 의미 부여

추출

Natural Language
Processing
• 문장의 끝 찾기(EOS Detection)

Natural Language
Processing
• 구문별 짝짓기 (POS Tagging)

Natural Language
Processing

Natural Language
Processing
• 추출( Extraction)

Natural Language
Processing
def cleanHtml(html):
return BeautifulStoneSoup(clean_html(html),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
fp = feedparser.parse(FEED_URL)
print "Fetched %s entries from '%s'" %
(len(fp.entries[0].title), fp.feed.title)
blog_posts = []
for e in fp.entries:
blog_posts.append({'title': e.title, 'content'
: cleanHtml(e.content[0].value), 'link': e.links[0].href})

Natural Language
Processing
# Basic stats
num_words = sum([i[1] for i in fdist.items()])
num_unique_words = len(fdist.keys())
# Hapaxes are words that appear only once
num_hapaxes = len(fdist.hapaxes())
top_10_words_sans_stop_words = [w for w in fdist.items()
if w[0] not in stop_words][:10]
print post['title']
print 'tNum Sentences:'.ljust(25), len(sentences)
print 'tNum Words:'.ljust(25), num_words
print 'tNum Unique Words:'.ljust(25), num_unique_words
print 'tNum Hapaxes:'.ljust(25), num_hapaxes
print 'tTop 10 Most Frequent Words (sans stop words):ntt',
'ntt'.join(['%s (%s)‘
% (w[0], w[1]) for w in top_10_words_sans_stop_words])
print

Natural Language
Processing
# Summaization Approach 1:
# Filter out non-significant sentences by using the average
score plus a
# fraction of the std dev as a filter

avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in
scored_sentences if score > avg + 0.5 * std]

# Summarization Approach 2:
# Another approach would be to return only the top N ranked
sentences

top_n_scored = sorted(scored_sentences, key=lambda s:
s[1])[-TOP_SENTENCES:]
top_n_scored = sorted(top_n_scored, key=lambda s: s[0])

Natural Language
Processing
– Luhn’s Summarization Algorithm
• Score = (문장에서 중요한 단어)^2)/(문장 총단어
수)

Natural Language
Processing
– Luhn’s Summarization Algorithm
• Score =

�ݺ�ߣ

Mining the social web ch8 - 1

More Related Content

Mining the social web ch8 - 1