際際滷

際際滷Share a Scribd company logo
Understanding Natural
Language Processing
Lakshya Sivaramakrishnan
Program Coordinator, Women Techmakers
@lakshyas90
What is Natural Language Processing?
Enable computers to derive meaning from natural
language
Source-https://ontotext.com/top-5-semantic-technology-trends-2017/
Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*
Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*
Why is NLU so hard?
 Ambiguity
 We dont know how meaning arises from words in our brains
 Huge diversity of languages
 Understanding sarcasm
Pycon India 2018   Natural Language Processing Workshop
Example Ambiguities
PROSTITUTES APPEAL TO POPE
Example Ambiguities
KASHMIR HEAD SEEKS ARMS
Example Ambiguities
ASHA WAS FOUND BY
THE RIVER HEAD
Example Ambiguities
ENRAGED COW INJURES FARMER
WITH AXE
Example Ambiguities
SQUAD HELPS DOG BITE VICTIM
NLP Processing Pipeline
Text extraction
Tokenization
Stopword Removal
Morphology
PoS Tagging
Syntactic Parsing
Mention Chunking
Entity Type Tagging
Coreference
Entity Resolution
Syntax
Semantics
Preprocessing
NLP Preprocessing: Text Extraction Tokenization
Stopword Removal
Preprocessing
Text extraction
#Text Extraction Function
def getTextFromWebsite(url):
page = urllib2.urlopen(url).read().decode('utf8')
soup = BeautifulSoup(page,"lxml")
text = ' '.join(map(lambda p: p.text, soup.find_all('article')))
return text.encode('ascii', errors='replace').replace("?"," ")
NLP Preprocessing: Tokenization
Text extraction
Tokenization
Stopword Removal
Preprocessing
Breaking text into pieces
#Tokenization
import nltk
text="Knowledge is knowing a tomato is a fruit.
Wisdom is not putting it in a fruit salad"
from nltk.tokenize import word_tokenize,
sent_tokenize
sents=sent_tokenize(text)
print(sents)
words=[word_tokenize(sent) for sent in sents]
print(words)
NLP Preprocessing: Stopword Removal
Text extraction
Tokenization
Stopword Removal
Preprocessing
#Stopword Removal
from nltk.corpus import stopwords
from string import punctuation
customStopWords=set(stopwords.words('english')+list(punctuation))
wordsWOStopwords=[word for word in word_tokenize(text) if word not
in customStopWords]
print(wordsWOStopwords)
Removing punctuations and most
occurring-less relevant words
NLP Syntax:
Morphology
Morphology
PoS Tagging
Syntactic Parsing
Syntax
#Stemming
text = "Mary closed on closing night when she was in the mood to
close."
from nltk.stem.lancaster import LancasterStemmer
st=LancasterStemmer()
stemmedWords=[st.stem(word) for word in word_tokenize(text)]
print(stemmedWords)
Word: Standalone unit of meaning
Lemma: Canonical form of word
Morpheme: minimal grammatical unit
NLP Syntax: POS Tagging
Morphology
PoS Tagging
Syntactic Parsing
Syntax
#POS Tagging
text = Lee Sedol cannot defeat AlphaGo.
nltk.pos_tag(word_tokenize(text))
Parts of Speech Tagging
MD: modal - could, will
NNP: proper noun, singular - 'Harish'
RB: adverb - very, silently,
VB: verb, base form - take
NLP Syntax: Syntactic Parsing Morphology
PoS Tagging
Syntactic Parsing
Syntax
#Defining Grammar
from nltk import CFG
grammar = CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "work" | "defeat" | "walked" | "chased"
NP -> "Lee" | "Sedol" | "AlphaGo" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "with"
""")
#Syntactic Parsing
from nltk.parse import RecursiveDescentParser
rd = RecursiveDescentParser(grammar)
sentence = 'the cat chased the dog'.split()
for t in rd.parse(sentence):
print(t)
NLP Semantics: Entities
Mention Chunking
Entity Type Tagging
Coreference
Entity Resolution
Semantics
[Lee Sedol] cant defeat [AlphaGo].
PER ORG
#Named Entity Recognition Chunker
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Lee Sedol cannot defeat AlphaGo"
print ne_chunk(pos_tag(word_tokenize(sentence)))
 geo = Geographical Entity
 org = Organization
 per = Person
 gpe = Geopolitical Entity
 tim = Time indicator
 art = Artifact
 eve = Event
 nat = Natural Phenomenon
NLP Semantics: Entity Resolution
Mention Chunking
Entity Type Tagging
Coreference
Entity Resolution
Semantics
#Entity Resolution
from nltk.corpus import wordnet as wn
for ss in wn.synsets('cool'):
print(ss, ss.definition())
from nltk.wsd import lesk
sense1 = lesk(word_tokenize("The movie has really cool effects"),'cool')
print(sense1, sense1.definition())
sense2 = lesk(word_tokenize("Dhoni is a cool head on the field"),'cool')
print(sense1, sense2.definition())
Also called as Word Sense Disambiguation
NLP Semantics: Coreference Entity Type Tagging
Coreference
Entity Resolution
Semantics
Mention Chunking
#Try Stanford CoreNLP
Search Question
Answering
Dialog/
Assistant
Translation Other NLP
problems
NLP Use Cases
Search
1. Understand query intent.
2. After recognizing the intent,
provide useful links.
Question Answering
 Is the user seeking an answer?
 Map to logical form
 Evaluate against a database
Dialog/ Assistant
 Search + QA + ...
 Understand context
 Ability to request clarification
 Language generation
Translation
 Translation is obviously very challenging.
 Context mistakes can get multiplied.
 Grammar sometimes has no easy mapping from one language to
another.
Other NLP Problems
 Predictive keyboard
 Choosing AdWords
 Query suggest
 Language identification
 News clustering
 Sentiment analysis for reviews
 Email/web snippets
 Smart reply
 Document classification
 Query classification
Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*
Sentiment Analysis
Opinion Mining
WHAT?
Analyzing unstructured
text to extract subjective
information and capture
the mood of the person.
HOW?
Algorithms scan keywords
to categorize a statement
as negative or positive
based on a simple binary
analysis. Ex: enjoyed =
good, miserable = bad
Source : brandwatch.com, analyticstraining.com
What are we going to code?
1. Extract twitter data using tweepy and learn how to handle it using
pandas.
i. Extract based on user handles
ii. Extract based on hashtags
iii. Extract with filters
2. Do sentiment analysis of extracted tweets using textblob.
Outcome: % of positive, negative and neutral tweets for a given extraction.
Use case
Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*
Gender Prediction of Indian Names WHAT?
 Given a name, predict if it is a
male or female
 Supervised Learning
 Classification Problem
HOW?
 Exploratory Data Analysis
 Visualizations
 Prediction using Naive Bayes
and Support Vector Machines
Source : shuttershock.com
What are we going to code?
1. Exploratory data analysis using numpy and pandas.
2. Visualizations using matplotlib
3. Feature Extraction based on insights gained
4. Gender Prediction given any name
5. Accuracy with train and test datasets.
Outcome: Gender Prediction,Accuracy and Most Informative Features
Use case
Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*
Google Cloud Natural Language API
The Cloud Natural
Language API lets you
extract entities from text,
perform sentiment and
syntactic analysis, and
classify text into
categories.
Used for sentiment
analysis and entity
recognition in a piece of
text.
Extract sentence, identify parts of
speech and create dependency parse
trees for each sentence.
Identify entities and label by types such
as person, organization, location, events,
products and media.
Understand the overall sentiment of a
block of text.
Access via REST API. Text can be
uploaded in the request or integrated
with Google Cloud Storage.
Syntax Analysis Entity Recognition
Sentiment Analysis Integrated REST API
Cloud ML Natural
Language Demo
Thank you
https://goo.gl/R8vbCT
Lakshya Sivaramakrishnan
@lakshyas90

More Related Content

Pycon India 2018 Natural Language Processing Workshop

  • 1. Understanding Natural Language Processing Lakshya Sivaramakrishnan Program Coordinator, Women Techmakers @lakshyas90
  • 2. What is Natural Language Processing? Enable computers to derive meaning from natural language Source-https://ontotext.com/top-5-semantic-technology-trends-2017/
  • 3. Aim of the workshop 1. Understand the NLP jargons and use cases 2. Use case on Sentiment Analysis 3. Use case on Gender Prediction Google Cloud Natural Language API*
  • 4. Aim of the workshop 1. Understand the NLP jargons and use cases 2. Use case on Sentiment Analysis 3. Use case on Gender Prediction Google Cloud Natural Language API*
  • 5. Why is NLU so hard? Ambiguity We dont know how meaning arises from words in our brains Huge diversity of languages Understanding sarcasm
  • 9. Example Ambiguities ASHA WAS FOUND BY THE RIVER HEAD
  • 10. Example Ambiguities ENRAGED COW INJURES FARMER WITH AXE
  • 12. NLP Processing Pipeline Text extraction Tokenization Stopword Removal Morphology PoS Tagging Syntactic Parsing Mention Chunking Entity Type Tagging Coreference Entity Resolution Syntax Semantics Preprocessing
  • 13. NLP Preprocessing: Text Extraction Tokenization Stopword Removal Preprocessing Text extraction #Text Extraction Function def getTextFromWebsite(url): page = urllib2.urlopen(url).read().decode('utf8') soup = BeautifulSoup(page,"lxml") text = ' '.join(map(lambda p: p.text, soup.find_all('article'))) return text.encode('ascii', errors='replace').replace("?"," ")
  • 14. NLP Preprocessing: Tokenization Text extraction Tokenization Stopword Removal Preprocessing Breaking text into pieces #Tokenization import nltk text="Knowledge is knowing a tomato is a fruit. Wisdom is not putting it in a fruit salad" from nltk.tokenize import word_tokenize, sent_tokenize sents=sent_tokenize(text) print(sents) words=[word_tokenize(sent) for sent in sents] print(words)
  • 15. NLP Preprocessing: Stopword Removal Text extraction Tokenization Stopword Removal Preprocessing #Stopword Removal from nltk.corpus import stopwords from string import punctuation customStopWords=set(stopwords.words('english')+list(punctuation)) wordsWOStopwords=[word for word in word_tokenize(text) if word not in customStopWords] print(wordsWOStopwords) Removing punctuations and most occurring-less relevant words
  • 16. NLP Syntax: Morphology Morphology PoS Tagging Syntactic Parsing Syntax #Stemming text = "Mary closed on closing night when she was in the mood to close." from nltk.stem.lancaster import LancasterStemmer st=LancasterStemmer() stemmedWords=[st.stem(word) for word in word_tokenize(text)] print(stemmedWords) Word: Standalone unit of meaning Lemma: Canonical form of word Morpheme: minimal grammatical unit
  • 17. NLP Syntax: POS Tagging Morphology PoS Tagging Syntactic Parsing Syntax #POS Tagging text = Lee Sedol cannot defeat AlphaGo. nltk.pos_tag(word_tokenize(text)) Parts of Speech Tagging MD: modal - could, will NNP: proper noun, singular - 'Harish' RB: adverb - very, silently, VB: verb, base form - take
  • 18. NLP Syntax: Syntactic Parsing Morphology PoS Tagging Syntactic Parsing Syntax #Defining Grammar from nltk import CFG grammar = CFG.fromstring(""" S -> NP VP VP -> V NP | V NP PP PP -> P NP V -> "work" | "defeat" | "walked" | "chased" NP -> "Lee" | "Sedol" | "AlphaGo" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "man" | "dog" | "cat" | "telescope" | "park" P -> "in" | "on" | "with" """) #Syntactic Parsing from nltk.parse import RecursiveDescentParser rd = RecursiveDescentParser(grammar) sentence = 'the cat chased the dog'.split() for t in rd.parse(sentence): print(t)
  • 19. NLP Semantics: Entities Mention Chunking Entity Type Tagging Coreference Entity Resolution Semantics [Lee Sedol] cant defeat [AlphaGo]. PER ORG #Named Entity Recognition Chunker from nltk import word_tokenize, pos_tag, ne_chunk sentence = "Lee Sedol cannot defeat AlphaGo" print ne_chunk(pos_tag(word_tokenize(sentence))) geo = Geographical Entity org = Organization per = Person gpe = Geopolitical Entity tim = Time indicator art = Artifact eve = Event nat = Natural Phenomenon
  • 20. NLP Semantics: Entity Resolution Mention Chunking Entity Type Tagging Coreference Entity Resolution Semantics #Entity Resolution from nltk.corpus import wordnet as wn for ss in wn.synsets('cool'): print(ss, ss.definition()) from nltk.wsd import lesk sense1 = lesk(word_tokenize("The movie has really cool effects"),'cool') print(sense1, sense1.definition()) sense2 = lesk(word_tokenize("Dhoni is a cool head on the field"),'cool') print(sense1, sense2.definition()) Also called as Word Sense Disambiguation
  • 21. NLP Semantics: Coreference Entity Type Tagging Coreference Entity Resolution Semantics Mention Chunking #Try Stanford CoreNLP
  • 23. Search 1. Understand query intent. 2. After recognizing the intent, provide useful links.
  • 24. Question Answering Is the user seeking an answer? Map to logical form Evaluate against a database
  • 25. Dialog/ Assistant Search + QA + ... Understand context Ability to request clarification Language generation
  • 26. Translation Translation is obviously very challenging. Context mistakes can get multiplied. Grammar sometimes has no easy mapping from one language to another.
  • 27. Other NLP Problems Predictive keyboard Choosing AdWords Query suggest Language identification News clustering Sentiment analysis for reviews Email/web snippets Smart reply Document classification Query classification
  • 28. Aim of the workshop 1. Understand the NLP jargons and use cases 2. Use case on Sentiment Analysis 3. Use case on Gender Prediction Google Cloud Natural Language API*
  • 29. Sentiment Analysis Opinion Mining WHAT? Analyzing unstructured text to extract subjective information and capture the mood of the person. HOW? Algorithms scan keywords to categorize a statement as negative or positive based on a simple binary analysis. Ex: enjoyed = good, miserable = bad Source : brandwatch.com, analyticstraining.com
  • 30. What are we going to code? 1. Extract twitter data using tweepy and learn how to handle it using pandas. i. Extract based on user handles ii. Extract based on hashtags iii. Extract with filters 2. Do sentiment analysis of extracted tweets using textblob. Outcome: % of positive, negative and neutral tweets for a given extraction. Use case
  • 31. Aim of the workshop 1. Understand the NLP jargons and use cases 2. Use case on Sentiment Analysis 3. Use case on Gender Prediction Google Cloud Natural Language API*
  • 32. Gender Prediction of Indian Names WHAT? Given a name, predict if it is a male or female Supervised Learning Classification Problem HOW? Exploratory Data Analysis Visualizations Prediction using Naive Bayes and Support Vector Machines Source : shuttershock.com
  • 33. What are we going to code? 1. Exploratory data analysis using numpy and pandas. 2. Visualizations using matplotlib 3. Feature Extraction based on insights gained 4. Gender Prediction given any name 5. Accuracy with train and test datasets. Outcome: Gender Prediction,Accuracy and Most Informative Features Use case
  • 34. Aim of the workshop 1. Understand the NLP jargons and use cases 2. Use case on Sentiment Analysis 3. Use case on Gender Prediction Google Cloud Natural Language API*
  • 35. Google Cloud Natural Language API The Cloud Natural Language API lets you extract entities from text, perform sentiment and syntactic analysis, and classify text into categories. Used for sentiment analysis and entity recognition in a piece of text. Extract sentence, identify parts of speech and create dependency parse trees for each sentence. Identify entities and label by types such as person, organization, location, events, products and media. Understand the overall sentiment of a block of text. Access via REST API. Text can be uploaded in the request or integrated with Google Cloud Storage. Syntax Analysis Entity Recognition Sentiment Analysis Integrated REST API