ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
TOPIC
MODELING IN
NLP &
EASY OCR
Presenters:
AMNA BIBI (19-CP-04)
IBRAHIM AHMED ( 19-CP-24)
CONTENTS OF THIS
PRESENTATION
¡°TOPIC MODELING IN NLP¡± PRESENTED
BY AMNA BIBI (19-CP-04)
¡°EASY OCR¡± PRESENTED BY
IBRAHIM AHMED (19-CP-24)
WHAT IS TOPIC MODELING
IN NLP
In Natural Language Processing, the
term ¡±topic¡± means a set of words that ¡°go together¡±.
Topic modelling in natural language processing is a
technique which assigns topic to a given corpus based
on the relevant group of words present in it.
Feature reduction allows us to focus on the relevant
material rather than wasting time sifting through all of the
data's text.
WHY IS TOPIC MODELING
IMPORTANT
Topic modelling is important, because in this world full
of data it has become increasingly important to
¡°categories the documents¡±.
For Example, a company receives hundred of reviews,
then it is important for the company to know what
categories of reviews are more important and vice versa.
SUPPOSE:-
There are 1000 documents and each document has 500 words, So to process
this it requires 500*1000 = 500000 threads. So when you divide the document
containing certain topics then if there are 5 topics present in it, the
processing is just 5*500 words = 2500 threads.
? Data will be processed through the following steps:
? Tokenization: Split the text into sentences and the sentences into
words. Lowercase the words and remove punctuation.
? Words that have fewer than 3 characters are removed.
? All stopwords are removed.
? Words are stemmed ¡ª words are reduced to their base/root form.
? Words are lemmatized ¡ª reducing words to their base form having
some actual meaning.
TOPIC MODELING IS UNSUPERVISED
MACHINE LEARNING ALGORITHM¡­!
Topic modeling is a machine learning technique that
automatically analyzes text data to determine cluster
words for a set of documents.
This is known as 'unsupervised' machine learning
because it doesn't require a predefined list of tags or
training data that's been previously classified by humans
TOPIC MODELING
TECHNIQUES
Some of the well known topic modelling
techniques are:
1. Latent Semantic Analysis (LSA)
2. Probabilistic Latent Semantic Analysis (PLSA)
3. Latent Dirichlet Allocation (LDA)
4. Correlated Topic Model (CTM)
LATENT DIRICHLET
ALLOCATION (LDA)
LDA, short for Latent Dirichlet Allocation is a technique
used for topic modelling.
? Latent means hidden, something that exists but is
yet to be found.
? Dirichlet indicates that the model ¡±assumes¡± that the
topics in the documents and the words in those
topics are relevant to each other.
? Allocation means to giving something, which in this
case are topics.
TOPIC EXTRACTION
? The algorithm was first introduced in 2003
and treats topics as probability
distributions for the occurrence of
different words.
? The topics are extracted from the corpus of
words on the basis of probability that a
document may contain a certain word and
there is a probability that such word may
be relevant to some specific topic.
TOPIC EXTRACTION
STEPS
Text pre-processing, removing lemmatization,
stop words, and punctuations.
Removing contextually less relevant words.
Perform batch-wise LDA which will provide
topics in batches.
MAIN LIBRARIES
? GENSIM: ¡°Generate Similar¡± is a popular open source
natural language processing (NLP) library used for
unsupervised topic modeling
? NLTK is a toolkit build for working with NLP in
Python, works with human language data. It provides
us various text processing libraries with a lot of test
datasets.
? ( tokenization, lemmatization, parsing, etc)
? pyLDAvis is an open-source python library that
helps in analyzing and creating highly interactive
visualization of the clusters created by LDA.
USAGE / BENEFITS
Extracting
Extracting
the words
from a
document
takes more
time and is
much more
complex
than
extracting
them from
topics
present in the
document.
Discovering
Discovering
hidden
topical
patterns
that are
present
across the
collection
Annotating
Annotating
documents
according to
these topics
Using
Using these
annotations
to organize,
search and
summarize
texts
APPLICATIONS
? Chat Bot
? Questioning/Answering,
? Health Care,
? Recommendation system,
? Similarity detection,
? Sentiment Analysis,
? Text Categorization,
? SEO
WHAT IS OCR?
? OCR stands for OPTICAL CHARACTER
RECOGNITION.
? OCR is a technology that analyzes the text of a
page and turns the letters into code that may be
used to process information.
? OCR systems are hardware and software systems
that turn physical documents into machine-
readable text.
HOW DOES OCR WORK?
? Image Pre-Processing
? AI Character Recognition
? Post-Processing
PRE-PROCESSING
Conversion of the document to digital
form like a picture from its physical form.
The purpose of this stage is for the machine's
representation to be precise while also
removing any undesired aberrations.
AI CHARACTER RECOGNITION
AI analyzes the image's dark portions to recognize characters and
numerals. Typically, AI uses one of the following approaches to target one
letter, phrase, or paragraph at a time:
? Pattern Recognition: Technologies use a range of language, text formats,
and handwriting to train the AI system. The program compares the
letters on the detected letter picture to the notes it has already learned
to find matches.
? Feature Recognition: The algorithm uses rules based on specific
character properties to recognize new characters. The amount of angled,
crossing, or curved lines in a letter is one example of a feature.
POST PROCESSING
AI corrects flaws in the final file during Post-
Processing. One approach is to teach the AI a
glossary of terms that will appear in the paper.
Then, limit the AI's output to those words/formats
to verify that no interpretations are beyond the
vocabulary.
IMPEDIMENTS TO OCR
PERFORMANCE
The image
can be
skewed or
non-
oriented
Colored
and
varying
backgroun
d patterns
Text in
glared or
blurry
images
OCR USE CASES BY INDUSTRY
?Banking
?Insurance
?Legal
?Healthcare
?Tourism
?Retail
SOME COMPANIES PROVIDING
SERVICES OF OCR
COMPANY AREA OF FOCUS CUSTOMERS TYPES OF SOLUTION
Google Cloud Vision API
Document recognition,
data capture
Chevron, Texas A&M
University
Continuously trained ML
ABBYY FineReader Document recognition,
data capture, language
processing
Dell, Fujitsu, HP, Siemens Continuously trained ML
PDFelement
Document data
extraction
Hitachi, Deloitte Template based
Rossum
Document data
extraction Bloomberg, IBM, Nvidia Continuously trained ML
EASYOCR
? EasyOCR was founded by JaidedAI.
? Jaided (pronounce as jai-ded) means "courageous
mind" in Thai.
? Jaided AI was founded in 2020.
? The first project is an open source OCR library
called EasyOCR.
? EasyOCR detects text in 80+ different languages
including: Latin, Chinese, Arabic, Devanagari, Cyrillic, etc.
EASYOCR
EasyOCR consists of class and object methods.
1. Reader class
2. readtext method
EASYOCR
? Reader class: Base class for EasyOCR
It is used to define what languages the model will
detect, will the model use gpu and many more
useful arguments while downloading.
? readtext method: Main method for reader object
It is used to define how the model will detect the
text like how and in what form the results will be
shown.
INSTALLING EASYOCR
LIBRARY
!pip install easyocr
USING EASYOCR
Detecting Text in Image
import easyocr
reader = easyocr.Reader(['ch_tra', 'en'])
result = reader.readtext('chinese_tra.jpg')
[([[448, 111], [917, 111], [917, 243], [448, 243]],' ¸ß èF ×ó  I
Õ¾ ',0.9247),
([[454, 214], [629, 214], [629, 290], [454, 290]], 'HSR', 0.9931)]
img = cv2.imread('chinese_tra.jpg')
result = reader.readtext(img)
USING EASYOCR
The standard output may look too complicated for
many, you can get simple output by passing
optional argument detail like this
reader.readtext('chinese_tra.jpg', detail = 0). And
this is what you will get.
[' ¸ßèF×ó IÕ¾ ', 'HSR', 'Station', ' Æû܇ÅRÍ£½ÓËÍ
…^ ', 'Kiss', 'Car', 'and', 'Ride']
USING EASYOCR
PARAGRAPH PARAMETER
Another useful optional argument for readtext function is
paragraph. By setting paragraph=True, EasyOCR will try to
combine raw result into easy-to-read paragraph. Here is the
result with reader.readtext('chinese_tra.jpg', detail = 0,
paragraph=True).
[' ¸ßèF×ó IÕ¾ HSR Station Æû܇ÅRÍ£½ÓËÍ…^ Car Kiss and
Ride']
USING EASYOCR
PARAGRAPH PARAMETER
THANK YOU!!!

More Related Content

Similar to TOPIC__MODELING_IN_NLP__& __EasyOCR.pptx (20)

PDF
OOPs-Interview-Questions.pdf
Samir P.
?
PPTX
NLP, Expert system and pattern recognition
Mohammad Ilyas Malik
?
PPTX
ooadunitiintroduction-150730050129-lva1-app6892.pptx
ubaidullah75790
?
PPT
Lecture1 Natural Language Processing for
abcdefghijklmtuvwxyz
?
PPTX
OOPs fundamentals session for freshers in my office (Aug 5, 13)
Ashoka R K T
?
PPTX
BCSE01T1003 Unit 1 Lec 1- Introduction to OOP.pptx
vipinrai36
?
PPTX
BCSE01T1003 Unit 1 Lec 1- Introduction to OOP.pptx
vipinrai36
?
PDF
CRC Final Report
Sangram Keshari Senapati
?
PPT
Java Fundamentalojhgghjjjjhhgghhjjjjhhj.ppt
akashsachu221
?
PDF
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
Himanshu kandwal
?
PDF
Generating docs from APIs
jamiehannaford
?
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
?
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
?
PDF
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
?
PPTX
Networking lesson 4 chaoter 1 Module 4-1.pptx
MAHERMOHAMED27
?
PDF
Class Diagram Extraction from Textual Requirements Using NLP Techniques
iosrjce
?
PDF
D017232729
IOSR Journals
?
PDF
Gen AI Applications in Different Industries.pdf
pallavidhade2
?
PPT
Information technology Researhc Tools in IT
AhamedShibly
?
PDF
Themes for graduation projects 2010
mohamedsamyali
?
OOPs-Interview-Questions.pdf
Samir P.
?
NLP, Expert system and pattern recognition
Mohammad Ilyas Malik
?
ooadunitiintroduction-150730050129-lva1-app6892.pptx
ubaidullah75790
?
Lecture1 Natural Language Processing for
abcdefghijklmtuvwxyz
?
OOPs fundamentals session for freshers in my office (Aug 5, 13)
Ashoka R K T
?
BCSE01T1003 Unit 1 Lec 1- Introduction to OOP.pptx
vipinrai36
?
BCSE01T1003 Unit 1 Lec 1- Introduction to OOP.pptx
vipinrai36
?
CRC Final Report
Sangram Keshari Senapati
?
Java Fundamentalojhgghjjjjhhgghhjjjjhhj.ppt
akashsachu221
?
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
Himanshu kandwal
?
Generating docs from APIs
jamiehannaford
?
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
?
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
?
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
?
Networking lesson 4 chaoter 1 Module 4-1.pptx
MAHERMOHAMED27
?
Class Diagram Extraction from Textual Requirements Using NLP Techniques
iosrjce
?
D017232729
IOSR Journals
?
Gen AI Applications in Different Industries.pdf
pallavidhade2
?
Information technology Researhc Tools in IT
AhamedShibly
?
Themes for graduation projects 2010
mohamedsamyali
?

Recently uploaded (20)

PPTX
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
?
PPTX
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
?
PDF
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
?
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
?
PDF
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
?
PDF
A Web Repository System for Data Mining in Drug Discovery
IJDKP
?
PPTX
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
?
PDF
Kafka Use Cases Real-World Applications
Accentfuture
?
PDF
NSEST - 2025-Brochure srm institute of science and technology
MaiyalaganT
?
PDF
TCU EVALUATION FACULTY TCU Taguig City 1st Semester 2017-2018
MELJUN CORTES
?
PDF
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
?
PPTX
Project_Update_Summary.for the use from PM
Odysseas Lekatsas
?
PDF
TESDA License NC II PC Operations TESDA, Office Productivity
MELJUN CORTES
?
PPTX
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
?
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
?
PDF
¶Ù²¹³Ù¨¤²¹²¹²¹²¹²¹²¹²¹²¹²¹±ð²Ô²µ¾±²Ô±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð
juadsr96
?
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
?
PDF
Predicting Titanic Survival Presentation
praxyfarhana
?
PPTX
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
?
PPTX
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
?
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
?
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
?
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
?
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
?
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
?
A Web Repository System for Data Mining in Drug Discovery
IJDKP
?
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
?
Kafka Use Cases Real-World Applications
Accentfuture
?
NSEST - 2025-Brochure srm institute of science and technology
MaiyalaganT
?
TCU EVALUATION FACULTY TCU Taguig City 1st Semester 2017-2018
MELJUN CORTES
?
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
?
Project_Update_Summary.for the use from PM
Odysseas Lekatsas
?
TESDA License NC II PC Operations TESDA, Office Productivity
MELJUN CORTES
?
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
?
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
?
¶Ù²¹³Ù¨¤²¹²¹²¹²¹²¹²¹²¹²¹²¹±ð²Ô²µ¾±²Ô±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð±ð
juadsr96
?
Informatics Market Insights AI Workforce.pdf
karizaroxx
?
Predicting Titanic Survival Presentation
praxyfarhana
?
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
?
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
?
Ad

TOPIC__MODELING_IN_NLP__& __EasyOCR.pptx

  • 1. TOPIC MODELING IN NLP & EASY OCR Presenters: AMNA BIBI (19-CP-04) IBRAHIM AHMED ( 19-CP-24)
  • 2. CONTENTS OF THIS PRESENTATION ¡°TOPIC MODELING IN NLP¡± PRESENTED BY AMNA BIBI (19-CP-04) ¡°EASY OCR¡± PRESENTED BY IBRAHIM AHMED (19-CP-24)
  • 3. WHAT IS TOPIC MODELING IN NLP In Natural Language Processing, the term ¡±topic¡± means a set of words that ¡°go together¡±. Topic modelling in natural language processing is a technique which assigns topic to a given corpus based on the relevant group of words present in it. Feature reduction allows us to focus on the relevant material rather than wasting time sifting through all of the data's text.
  • 4. WHY IS TOPIC MODELING IMPORTANT Topic modelling is important, because in this world full of data it has become increasingly important to ¡°categories the documents¡±. For Example, a company receives hundred of reviews, then it is important for the company to know what categories of reviews are more important and vice versa.
  • 5. SUPPOSE:- There are 1000 documents and each document has 500 words, So to process this it requires 500*1000 = 500000 threads. So when you divide the document containing certain topics then if there are 5 topics present in it, the processing is just 5*500 words = 2500 threads. ? Data will be processed through the following steps: ? Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation. ? Words that have fewer than 3 characters are removed. ? All stopwords are removed. ? Words are stemmed ¡ª words are reduced to their base/root form. ? Words are lemmatized ¡ª reducing words to their base form having some actual meaning.
  • 6. TOPIC MODELING IS UNSUPERVISED MACHINE LEARNING ALGORITHM¡­! Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. This is known as 'unsupervised' machine learning because it doesn't require a predefined list of tags or training data that's been previously classified by humans
  • 7. TOPIC MODELING TECHNIQUES Some of the well known topic modelling techniques are: 1. Latent Semantic Analysis (LSA) 2. Probabilistic Latent Semantic Analysis (PLSA) 3. Latent Dirichlet Allocation (LDA) 4. Correlated Topic Model (CTM)
  • 8. LATENT DIRICHLET ALLOCATION (LDA) LDA, short for Latent Dirichlet Allocation is a technique used for topic modelling. ? Latent means hidden, something that exists but is yet to be found. ? Dirichlet indicates that the model ¡±assumes¡± that the topics in the documents and the words in those topics are relevant to each other. ? Allocation means to giving something, which in this case are topics.
  • 9. TOPIC EXTRACTION ? The algorithm was first introduced in 2003 and treats topics as probability distributions for the occurrence of different words. ? The topics are extracted from the corpus of words on the basis of probability that a document may contain a certain word and there is a probability that such word may be relevant to some specific topic.
  • 10. TOPIC EXTRACTION STEPS Text pre-processing, removing lemmatization, stop words, and punctuations. Removing contextually less relevant words. Perform batch-wise LDA which will provide topics in batches.
  • 11. MAIN LIBRARIES ? GENSIM: ¡°Generate Similar¡± is a popular open source natural language processing (NLP) library used for unsupervised topic modeling ? NLTK is a toolkit build for working with NLP in Python, works with human language data. It provides us various text processing libraries with a lot of test datasets. ? ( tokenization, lemmatization, parsing, etc) ? pyLDAvis is an open-source python library that helps in analyzing and creating highly interactive visualization of the clusters created by LDA.
  • 12. USAGE / BENEFITS Extracting Extracting the words from a document takes more time and is much more complex than extracting them from topics present in the document. Discovering Discovering hidden topical patterns that are present across the collection Annotating Annotating documents according to these topics Using Using these annotations to organize, search and summarize texts
  • 13. APPLICATIONS ? Chat Bot ? Questioning/Answering, ? Health Care, ? Recommendation system, ? Similarity detection, ? Sentiment Analysis, ? Text Categorization, ? SEO
  • 14. WHAT IS OCR? ? OCR stands for OPTICAL CHARACTER RECOGNITION. ? OCR is a technology that analyzes the text of a page and turns the letters into code that may be used to process information. ? OCR systems are hardware and software systems that turn physical documents into machine- readable text.
  • 15. HOW DOES OCR WORK? ? Image Pre-Processing ? AI Character Recognition ? Post-Processing
  • 16. PRE-PROCESSING Conversion of the document to digital form like a picture from its physical form. The purpose of this stage is for the machine's representation to be precise while also removing any undesired aberrations.
  • 17. AI CHARACTER RECOGNITION AI analyzes the image's dark portions to recognize characters and numerals. Typically, AI uses one of the following approaches to target one letter, phrase, or paragraph at a time: ? Pattern Recognition: Technologies use a range of language, text formats, and handwriting to train the AI system. The program compares the letters on the detected letter picture to the notes it has already learned to find matches. ? Feature Recognition: The algorithm uses rules based on specific character properties to recognize new characters. The amount of angled, crossing, or curved lines in a letter is one example of a feature.
  • 18. POST PROCESSING AI corrects flaws in the final file during Post- Processing. One approach is to teach the AI a glossary of terms that will appear in the paper. Then, limit the AI's output to those words/formats to verify that no interpretations are beyond the vocabulary.
  • 19. IMPEDIMENTS TO OCR PERFORMANCE The image can be skewed or non- oriented Colored and varying backgroun d patterns Text in glared or blurry images
  • 20. OCR USE CASES BY INDUSTRY ?Banking ?Insurance ?Legal ?Healthcare ?Tourism ?Retail
  • 21. SOME COMPANIES PROVIDING SERVICES OF OCR COMPANY AREA OF FOCUS CUSTOMERS TYPES OF SOLUTION Google Cloud Vision API Document recognition, data capture Chevron, Texas A&M University Continuously trained ML ABBYY FineReader Document recognition, data capture, language processing Dell, Fujitsu, HP, Siemens Continuously trained ML PDFelement Document data extraction Hitachi, Deloitte Template based Rossum Document data extraction Bloomberg, IBM, Nvidia Continuously trained ML
  • 22. EASYOCR ? EasyOCR was founded by JaidedAI. ? Jaided (pronounce as jai-ded) means "courageous mind" in Thai. ? Jaided AI was founded in 2020. ? The first project is an open source OCR library called EasyOCR. ? EasyOCR detects text in 80+ different languages including: Latin, Chinese, Arabic, Devanagari, Cyrillic, etc.
  • 23. EASYOCR EasyOCR consists of class and object methods. 1. Reader class 2. readtext method
  • 24. EASYOCR ? Reader class: Base class for EasyOCR It is used to define what languages the model will detect, will the model use gpu and many more useful arguments while downloading. ? readtext method: Main method for reader object It is used to define how the model will detect the text like how and in what form the results will be shown.
  • 26. USING EASYOCR Detecting Text in Image import easyocr reader = easyocr.Reader(['ch_tra', 'en']) result = reader.readtext('chinese_tra.jpg') [([[448, 111], [917, 111], [917, 243], [448, 243]],' ¸ß èF ×ó  I Õ¾ ',0.9247), ([[454, 214], [629, 214], [629, 290], [454, 290]], 'HSR', 0.9931)] img = cv2.imread('chinese_tra.jpg') result = reader.readtext(img)
  • 27. USING EASYOCR The standard output may look too complicated for many, you can get simple output by passing optional argument detail like this reader.readtext('chinese_tra.jpg', detail = 0). And this is what you will get. [' ¸ßèF×ó IÕ¾ ', 'HSR', 'Station', ' Æû܇ÅRÍ£½ÓËÍ …^ ', 'Kiss', 'Car', 'and', 'Ride']
  • 28. USING EASYOCR PARAGRAPH PARAMETER Another useful optional argument for readtext function is paragraph. By setting paragraph=True, EasyOCR will try to combine raw result into easy-to-read paragraph. Here is the result with reader.readtext('chinese_tra.jpg', detail = 0, paragraph=True). [' ¸ßèF×ó IÕ¾ HSR Station Æû܇ÅRÍ£½ÓËÍ…^ Car Kiss and Ride']