1. The document discusses an introduction to natural language processing (NLP) including definitions of key NLP concepts and techniques.
2. It provides examples of common NLP tasks like sentiment analysis, entity recognition, and gender prediction and shows code for performing these tasks.
3. The document concludes with an overview of the Google Cloud Natural Language API for applying NLP techniques through a REST API.
1 of 37
More Related Content
Pycon India 2018 Natural Language Processing Workshop
2. What is Natural Language Processing?
Enable computers to derive meaning from natural
language
Source-https://ontotext.com/top-5-semantic-technology-trends-2017/
3. Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*
4. Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*
5. Why is NLU so hard?
Ambiguity
We dont know how meaning arises from words in our brains
Huge diversity of languages
Understanding sarcasm
13. NLP Preprocessing: Text Extraction Tokenization
Stopword Removal
Preprocessing
Text extraction
#Text Extraction Function
def getTextFromWebsite(url):
page = urllib2.urlopen(url).read().decode('utf8')
soup = BeautifulSoup(page,"lxml")
text = ' '.join(map(lambda p: p.text, soup.find_all('article')))
return text.encode('ascii', errors='replace').replace("?"," ")
14. NLP Preprocessing: Tokenization
Text extraction
Tokenization
Stopword Removal
Preprocessing
Breaking text into pieces
#Tokenization
import nltk
text="Knowledge is knowing a tomato is a fruit.
Wisdom is not putting it in a fruit salad"
from nltk.tokenize import word_tokenize,
sent_tokenize
sents=sent_tokenize(text)
print(sents)
words=[word_tokenize(sent) for sent in sents]
print(words)
15. NLP Preprocessing: Stopword Removal
Text extraction
Tokenization
Stopword Removal
Preprocessing
#Stopword Removal
from nltk.corpus import stopwords
from string import punctuation
customStopWords=set(stopwords.words('english')+list(punctuation))
wordsWOStopwords=[word for word in word_tokenize(text) if word not
in customStopWords]
print(wordsWOStopwords)
Removing punctuations and most
occurring-less relevant words
16. NLP Syntax:
Morphology
Morphology
PoS Tagging
Syntactic Parsing
Syntax
#Stemming
text = "Mary closed on closing night when she was in the mood to
close."
from nltk.stem.lancaster import LancasterStemmer
st=LancasterStemmer()
stemmedWords=[st.stem(word) for word in word_tokenize(text)]
print(stemmedWords)
Word: Standalone unit of meaning
Lemma: Canonical form of word
Morpheme: minimal grammatical unit
17. NLP Syntax: POS Tagging
Morphology
PoS Tagging
Syntactic Parsing
Syntax
#POS Tagging
text = Lee Sedol cannot defeat AlphaGo.
nltk.pos_tag(word_tokenize(text))
Parts of Speech Tagging
MD: modal - could, will
NNP: proper noun, singular - 'Harish'
RB: adverb - very, silently,
VB: verb, base form - take
18. NLP Syntax: Syntactic Parsing Morphology
PoS Tagging
Syntactic Parsing
Syntax
#Defining Grammar
from nltk import CFG
grammar = CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "work" | "defeat" | "walked" | "chased"
NP -> "Lee" | "Sedol" | "AlphaGo" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "with"
""")
#Syntactic Parsing
from nltk.parse import RecursiveDescentParser
rd = RecursiveDescentParser(grammar)
sentence = 'the cat chased the dog'.split()
for t in rd.parse(sentence):
print(t)
19. NLP Semantics: Entities
Mention Chunking
Entity Type Tagging
Coreference
Entity Resolution
Semantics
[Lee Sedol] cant defeat [AlphaGo].
PER ORG
#Named Entity Recognition Chunker
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Lee Sedol cannot defeat AlphaGo"
print ne_chunk(pos_tag(word_tokenize(sentence)))
geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon
20. NLP Semantics: Entity Resolution
Mention Chunking
Entity Type Tagging
Coreference
Entity Resolution
Semantics
#Entity Resolution
from nltk.corpus import wordnet as wn
for ss in wn.synsets('cool'):
print(ss, ss.definition())
from nltk.wsd import lesk
sense1 = lesk(word_tokenize("The movie has really cool effects"),'cool')
print(sense1, sense1.definition())
sense2 = lesk(word_tokenize("Dhoni is a cool head on the field"),'cool')
print(sense1, sense2.definition())
Also called as Word Sense Disambiguation
24. Question Answering
Is the user seeking an answer?
Map to logical form
Evaluate against a database
25. Dialog/ Assistant
Search + QA + ...
Understand context
Ability to request clarification
Language generation
26. Translation
Translation is obviously very challenging.
Context mistakes can get multiplied.
Grammar sometimes has no easy mapping from one language to
another.
27. Other NLP Problems
Predictive keyboard
Choosing AdWords
Query suggest
Language identification
News clustering
Sentiment analysis for reviews
Email/web snippets
Smart reply
Document classification
Query classification
28. Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*
29. Sentiment Analysis
Opinion Mining
WHAT?
Analyzing unstructured
text to extract subjective
information and capture
the mood of the person.
HOW?
Algorithms scan keywords
to categorize a statement
as negative or positive
based on a simple binary
analysis. Ex: enjoyed =
good, miserable = bad
Source : brandwatch.com, analyticstraining.com
30. What are we going to code?
1. Extract twitter data using tweepy and learn how to handle it using
pandas.
i. Extract based on user handles
ii. Extract based on hashtags
iii. Extract with filters
2. Do sentiment analysis of extracted tweets using textblob.
Outcome: % of positive, negative and neutral tweets for a given extraction.
Use case
31. Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*
32. Gender Prediction of Indian Names WHAT?
Given a name, predict if it is a
male or female
Supervised Learning
Classification Problem
HOW?
Exploratory Data Analysis
Visualizations
Prediction using Naive Bayes
and Support Vector Machines
Source : shuttershock.com
33. What are we going to code?
1. Exploratory data analysis using numpy and pandas.
2. Visualizations using matplotlib
3. Feature Extraction based on insights gained
4. Gender Prediction given any name
5. Accuracy with train and test datasets.
Outcome: Gender Prediction,Accuracy and Most Informative Features
Use case
34. Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*
35. Google Cloud Natural Language API
The Cloud Natural
Language API lets you
extract entities from text,
perform sentiment and
syntactic analysis, and
classify text into
categories.
Used for sentiment
analysis and entity
recognition in a piece of
text.
Extract sentence, identify parts of
speech and create dependency parse
trees for each sentence.
Identify entities and label by types such
as person, organization, location, events,
products and media.
Understand the overall sentiment of a
block of text.
Access via REST API. Text can be
uploaded in the request or integrated
with Google Cloud Storage.
Syntax Analysis Entity Recognition
Sentiment Analysis Integrated REST API