�ݺ�ߣ

Understanding Natural
Language Processing
Lakshya Sivaramakrishnan
Program Coordinator, Women Techmakers
@lakshyas90

What is Natural Language Processing?
Enable computers to derive meaning from natural
language
Source-https://ontotext.com/top-5-semantic-technology-trends-2017/

Aim of the workshop
1. Understand the NLP jargons
and use cases
2. Use case on Sentiment Analysis
3. Use case on Gender Prediction
Google Cloud Natural Language API*

Why is NLU so hard?
● Ambiguity
● We don’t know how meaning arises from words in our brains
● Huge diversity of languages
● Understanding sarcasm

Pycon India 2018 Natural Language Processing Workshop

Example Ambiguities
“PROSTITUTES APPEAL TO POPE”

Example Ambiguities
“KASHMIR HEAD SEEKS ARMS”

Example Ambiguities
“ASHA WAS FOUND BY
THE RIVER HEAD”

Example Ambiguities
“ENRAGED COW INJURES FARMER
WITH AXE”

Example Ambiguities
“SQUAD HELPS DOG BITE VICTIM”

NLP Processing Pipeline
Text extraction
Tokenization
Stopword Removal
Morphology
PoS Tagging
Syntactic Parsing
Mention Chunking
Entity Type Tagging
Coreference
Entity Resolution
Syntax
Semantics
Preprocessing

NLP Preprocessing: Text Extraction Tokenization
Stopword Removal
Preprocessing
Text extraction
#Text Extraction Function
def getTextFromWebsite(url):
page = urllib2.urlopen(url).read().decode('utf8')
soup = BeautifulSoup(page,"lxml")
text = ' '.join(map(lambda p: p.text, soup.find_all('article')))
return text.encode('ascii', errors='replace').replace("?"," ")

NLP Preprocessing: Tokenization
Text extraction
Tokenization
Stopword Removal
Preprocessing
Breaking text into pieces
#Tokenization
import nltk
text="Knowledge is knowing a tomato is a fruit.
Wisdom is not putting it in a fruit salad"
from nltk.tokenize import word_tokenize,
sent_tokenize
sents=sent_tokenize(text)
print(sents)
words=[word_tokenize(sent) for sent in sents]
print(words)

NLP Preprocessing: Stopword Removal
Text extraction
Tokenization
Stopword Removal
Preprocessing
#Stopword Removal
from nltk.corpus import stopwords
from string import punctuation
customStopWords=set(stopwords.words('english')+list(punctuation))
wordsWOStopwords=[word for word in word_tokenize(text) if word not
in customStopWords]
print(wordsWOStopwords)
Removing punctuations and most
occurring-less relevant words

NLP Syntax:
Morphology
Morphology
PoS Tagging
Syntactic Parsing
Syntax
#Stemming
text = "Mary closed on closing night when she was in the mood to
close."
from nltk.stem.lancaster import LancasterStemmer
st=LancasterStemmer()
stemmedWords=[st.stem(word) for word in word_tokenize(text)]
print(stemmedWords)
Word: Standalone unit of meaning
Lemma: Canonical form of word
Morpheme: minimal grammatical unit

NLP Syntax: POS Tagging
Morphology
PoS Tagging
Syntactic Parsing
Syntax
#POS Tagging
text = ‘Lee Sedol cannot defeat AlphaGo.’
nltk.pos_tag(word_tokenize(text))
Parts of Speech Tagging
MD: modal - could, will
NNP: proper noun, singular - 'Harish'
RB: adverb - very, silently,
VB: verb, base form - take

NLP Semantics: Entities
Mention Chunking
Entity Type Tagging
Coreference
Entity Resolution
Semantics
[Lee Sedol] can’t defeat [AlphaGo].
PER ORG
#Named Entity Recognition Chunker
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Lee Sedol cannot defeat AlphaGo"
print ne_chunk(pos_tag(word_tokenize(sentence)))
■ geo = Geographical Entity
■ org = Organization
■ per = Person
■ gpe = Geopolitical Entity
■ tim = Time indicator
■ art = Artifact
■ eve = Event
■ nat = Natural Phenomenon

NLP Semantics: Entity Resolution
Mention Chunking
Entity Type Tagging
Coreference
Entity Resolution
Semantics
#Entity Resolution
from nltk.corpus import wordnet as wn
for ss in wn.synsets('cool'):
print(ss, ss.definition())
from nltk.wsd import lesk
sense1 = lesk(word_tokenize("The movie has really cool effects"),'cool')
print(sense1, sense1.definition())
sense2 = lesk(word_tokenize("Dhoni is a cool head on the field"),'cool')
print(sense1, sense2.definition())
Also called as Word Sense Disambiguation

NLP Semantics: Coreference Entity Type Tagging
Coreference
Entity Resolution
Semantics
Mention Chunking
#Try Stanford CoreNLP

Search Question
Answering
Dialog/
Assistant
Translation Other NLP
problems
NLP Use Cases

Search
1. Understand query intent.
2. After recognizing the intent,
provide useful links.

Question Answering
● Is the user seeking an answer?
● Map to logical form
● Evaluate against a database

Dialog/ Assistant
● Search + QA + ...
● Understand context
● Ability to request clarification
● Language generation

Translation
● Translation is obviously very challenging.
○ Context mistakes can get multiplied.
○ Grammar sometimes has no easy mapping from one language to
another.

Other NLP Problems
● Predictive keyboard
● Choosing AdWords
● Query suggest
● Language identification
● News clustering
● Sentiment analysis for reviews
● Email/web snippets
● Smart reply
● Document classification
● Query classification

Sentiment Analysis
Opinion Mining
WHAT?
Analyzing unstructured
text to extract subjective
information and capture
the mood of the person.
HOW?
Algorithms scan keywords
to categorize a statement
as negative or positive
based on a simple binary
analysis. Ex: “enjoyed” =
good, “miserable” = bad
Source : brandwatch.com, analyticstraining.com

What are we going to code?
1. Extract twitter data using tweepy and learn how to handle it using
pandas.
i. Extract based on user handles
ii. Extract based on hashtags
iii. Extract with filters
2. Do sentiment analysis of extracted tweets using textblob.
Outcome: % of positive, negative and neutral tweets for a given extraction.
Use case

Gender Prediction of Indian Names WHAT?
● Given a name, predict if it is a
male or female
● Supervised Learning
● Classification Problem
HOW?
● Exploratory Data Analysis
● Visualizations
● Prediction using Naive Bayes
and Support Vector Machines
Source : shuttershock.com

What are we going to code?
1. Exploratory data analysis using numpy and pandas.
2. Visualizations using matplotlib
3. Feature Extraction based on insights gained
4. Gender Prediction given any name
5. Accuracy with train and test datasets.
Outcome: Gender Prediction,Accuracy and Most Informative Features
Use case

Google Cloud Natural Language API
The Cloud Natural
Language API lets you
extract entities from text,
perform sentiment and
syntactic analysis, and
classify text into
categories.
Used for sentiment
analysis and entity
recognition in a piece of
text.
Extract sentence, identify parts of
speech and create dependency parse
trees for each sentence.
Identify entities and label by types such
as person, organization, location, events,
products and media.
Understand the overall sentiment of a
block of text.
Access via REST API. Text can be
uploaded in the request or integrated
with Google Cloud Storage.
Syntax Analysis Entity Recognition
Sentiment Analysis Integrated REST API

Cloud ML Natural
Language Demo

Thank you
https://goo.gl/R8vbCT
Lakshya Sivaramakrishnan
@lakshyas90

�ݺ�ߣ

Pycon India 2018 Natural Language Processing Workshop

More Related Content

Pycon India 2018 Natural Language Processing Workshop