ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Steps involved in Preprocessing :

1.Tokenization :

¡ñ
    Tokenization : The process of breaking a stream of text into words


¡ñ
    Removal of Punctuation marks and numbers


¡ñ
    Replacing ¡®n¡¯ by Spaces


¡ñ
    Splitting the string by space as a delimiter


¡ñ
    Tokens
Graphical view of steps in Tokenization :



                  Removal of       Replacing n   Using
Stream of text.                                   spaces as
                  punctuation      by Spaces
                  marks                           delimiter




                                                   Tokens
                                                   (words)
2. Removal of stop words :


 ¡ñ
     Passing the list of Tokens.


 ¡ñ
     Removing the unnecessary words like the, an, so, after, all, etc (stop words).


 ¡ñ
     Output : A list of meaningful words.
3.Stemming :

  ¡ñ
      Stemming : The process for reducing inflected words to their stem, base or root form.
      For example : Stemming algorithm reduces ¡°fishing", "fished", "fish", and "fisher"
             to the root word "fish¡°.


  ¡ñ
      Stemmer used : Porter Stemmer Algorithm.


  ¡ñ
      Removing ¡®¨Cee¡¯,¡¯ ¨Ced¡¯, ¡®-ing¡¯, ¡®-ence¡¯, ¡®-er¡¯, etc. & adding ¡®y¡¯, ¡®I¡¯ as required.


  ¡ñ
      Doesn¡¯t give accurate roots .
      Example : stem(flying) =fli
         stem(fly)=fli
  ¡ñ
      Same roots for all inflected forms ¨C serves our purpose
4. Vocabulary creation :-


¡ñ
    Vocabulary : Generally, vocabulary is the set of words.


¡ñ
    Vocabulary = Union of words from all files.


¡ñ
    For each document : Converting list obtained after stemming into Set &
      taking union.


¡ñ
    Processed further for Tf-idf evaluation.
Ad

Recommended

H A N D O U T S F O R E I G N E X C H A N G E
H A N D O U T S F O R E I G N E X C H A N G E
Justine Guillerma Garcia
?
Rc - The Plan 9 Shell
Rc - The Plan 9 Shell
twopoint718
?
¤¤¤«¤Ë¤·¤Æ¥¨¥í„Ó»­¤ò—ÊË÷¤¹¤ë¤« GXEB #03
¤¤¤«¤Ë¤·¤Æ¥¨¥í„Ó»­¤ò—ÊË÷¤¹¤ë¤« GXEB #03
Yusuke Wada
?
Data preprocessing
Data preprocessing
ksamyMCA
?
Dsip and its biometrics appln
Dsip and its biometrics appln
Dr. Vinayak Bharadi
?
Feature Matching using SIFT algorithm
Feature Matching using SIFT algorithm
Sajid Pareeth
?
SIFT Algorithm Introduction
SIFT Algorithm Introduction
Truong LD
?
Scale Invariant Feature Tranform
Scale Invariant Feature Tranform
Shanker Naik
?
Scale Invariant feature transform
Scale Invariant feature transform
Shanker Naik
?
SIFT
SIFT
Nitin Ramchandani
?
Face recognition
Face recognition
Satyendra Rajput
?
fMRI preprocessing steps (in SPM8)
fMRI preprocessing steps (in SPM8)
Sunghyon Kyeong
?
Face recogntion Using PCA Algorithm
Face recogntion Using PCA Algorithm
Ashwini Awatare
?
face recognition system using LBP
face recognition system using LBP
Marwan H. Noman
?
PCA Based Face Recognition System
PCA Based Face Recognition System
Md. Atiqur Rahman
?
Local binary pattern
Local binary pattern
International Islamic University
?
Face Recognition using PCA-Principal Component Analysis using MATLAB
Face Recognition using PCA-Principal Component Analysis using MATLAB
Sindhi Madhuri
?
face recognition system using LBP
face recognition system using LBP
Marwan H. Noman
?
Image pre processing
Image pre processing
Ashish Kumar
?
Information retrieval
Information retrieval
Ujjawal
?
NLP WITH NA?VE BAYES CLASSIFIER (1).pptx
NLP WITH NA?VE BAYES CLASSIFIER (1).pptx
rohithprabhas1
?
Text processing_the_university_of_azad_kashmir
Text processing_the_university_of_azad_kashmir
mh187782
?
Text Analysis Operations using NLTK.pptx
Text Analysis Operations using NLTK.pptx
devamrana27
?
Concepts of NLP.pptx
Concepts of NLP.pptx
Judesharp1
?
Text Pre-Processing Techniques in Natural Language Processing: A Review
Text Pre-Processing Techniques in Natural Language Processing: A Review
IRJET Journal
?
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
SamuelKetema1
?
Natural Language Processing in Artificial intelligence
Natural Language Processing in Artificial intelligence
raghu19136
?
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
habtaassefa0
?
stemming and tokanization in corpus.pptx
stemming and tokanization in corpus.pptx
Athar Baig
?

More Related Content

Viewers also liked (12)

Scale Invariant feature transform
Scale Invariant feature transform
Shanker Naik
?
SIFT
SIFT
Nitin Ramchandani
?
Face recognition
Face recognition
Satyendra Rajput
?
fMRI preprocessing steps (in SPM8)
fMRI preprocessing steps (in SPM8)
Sunghyon Kyeong
?
Face recogntion Using PCA Algorithm
Face recogntion Using PCA Algorithm
Ashwini Awatare
?
face recognition system using LBP
face recognition system using LBP
Marwan H. Noman
?
PCA Based Face Recognition System
PCA Based Face Recognition System
Md. Atiqur Rahman
?
Local binary pattern
Local binary pattern
International Islamic University
?
Face Recognition using PCA-Principal Component Analysis using MATLAB
Face Recognition using PCA-Principal Component Analysis using MATLAB
Sindhi Madhuri
?
face recognition system using LBP
face recognition system using LBP
Marwan H. Noman
?
Image pre processing
Image pre processing
Ashish Kumar
?

Similar to Preprocessing (17)

Information retrieval
Information retrieval
Ujjawal
?
NLP WITH NA?VE BAYES CLASSIFIER (1).pptx
NLP WITH NA?VE BAYES CLASSIFIER (1).pptx
rohithprabhas1
?
Text processing_the_university_of_azad_kashmir
Text processing_the_university_of_azad_kashmir
mh187782
?
Text Analysis Operations using NLTK.pptx
Text Analysis Operations using NLTK.pptx
devamrana27
?
Concepts of NLP.pptx
Concepts of NLP.pptx
Judesharp1
?
Text Pre-Processing Techniques in Natural Language Processing: A Review
Text Pre-Processing Techniques in Natural Language Processing: A Review
IRJET Journal
?
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
SamuelKetema1
?
Natural Language Processing in Artificial intelligence
Natural Language Processing in Artificial intelligence
raghu19136
?
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
habtaassefa0
?
stemming and tokanization in corpus.pptx
stemming and tokanization in corpus.pptx
Athar Baig
?
overview of natural language processing concepts
overview of natural language processing concepts
nazimsattar
?
NLP Msc Computer science S2 Kerala University
NLP Msc Computer science S2 Kerala University
vineethpradeep50
?
01_Unit_2 (1).pptx kjnjnlknknkjnnnm kmn n
01_Unit_2 (1).pptx kjnjnlknknkjnnnm kmn n
BharathRoyal11
?
NLP Concepts detail explained in details.pptx
NLP Concepts detail explained in details.pptx
FaizRahman56
?
learn about text preprocessing nip using nltk
learn about text preprocessing nip using nltk
en21cs301047
?
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Edureka!
?
AM4TM_WS22_Practice_01_NLP_Basics.pdf
AM4TM_WS22_Practice_01_NLP_Basics.pdf
mewajok782
?
Information retrieval
Information retrieval
Ujjawal
?
NLP WITH NA?VE BAYES CLASSIFIER (1).pptx
NLP WITH NA?VE BAYES CLASSIFIER (1).pptx
rohithprabhas1
?
Text processing_the_university_of_azad_kashmir
Text processing_the_university_of_azad_kashmir
mh187782
?
Text Analysis Operations using NLTK.pptx
Text Analysis Operations using NLTK.pptx
devamrana27
?
Concepts of NLP.pptx
Concepts of NLP.pptx
Judesharp1
?
Text Pre-Processing Techniques in Natural Language Processing: A Review
Text Pre-Processing Techniques in Natural Language Processing: A Review
IRJET Journal
?
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
SamuelKetema1
?
Natural Language Processing in Artificial intelligence
Natural Language Processing in Artificial intelligence
raghu19136
?
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
habtaassefa0
?
stemming and tokanization in corpus.pptx
stemming and tokanization in corpus.pptx
Athar Baig
?
overview of natural language processing concepts
overview of natural language processing concepts
nazimsattar
?
NLP Msc Computer science S2 Kerala University
NLP Msc Computer science S2 Kerala University
vineethpradeep50
?
01_Unit_2 (1).pptx kjnjnlknknkjnnnm kmn n
01_Unit_2 (1).pptx kjnjnlknknkjnnnm kmn n
BharathRoyal11
?
NLP Concepts detail explained in details.pptx
NLP Concepts detail explained in details.pptx
FaizRahman56
?
learn about text preprocessing nip using nltk
learn about text preprocessing nip using nltk
en21cs301047
?
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Edureka!
?
AM4TM_WS22_Practice_01_NLP_Basics.pdf
AM4TM_WS22_Practice_01_NLP_Basics.pdf
mewajok782
?
Ad

Recently uploaded (20)

FME for Good: Integrating Multiple Data Sources with APIs to Support Local Ch...
FME for Good: Integrating Multiple Data Sources with APIs to Support Local Ch...
Safe Software
?
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Alliance
?
Edge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdf
AmirStern2
?
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
?
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
Safe Software
?
Murdledescargadarkweb.pdfvolumen1 100 elementary
Murdledescargadarkweb.pdfvolumen1 100 elementary
JorgeSemperteguiMont
?
Enabling BIM / GIS integrations with Other Systems with FME
Enabling BIM / GIS integrations with Other Systems with FME
Safe Software
?
High Availability On-Premises FME Flow.pdf
High Availability On-Premises FME Flow.pdf
Safe Software
?
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Safe Software
?
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance
?
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Alliance
?
¡°Why It¡¯s Critical to Have an Integrated Development Methodology for Edge AI,...
¡°Why It¡¯s Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
?
¡°Addressing Evolving AI Model Challenges Through Memory and Storage,¡± a Prese...
¡°Addressing Evolving AI Model Challenges Through Memory and Storage,¡± a Prese...
Edge AI and Vision Alliance
?
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
?
Crypto Super 500 - 14th Report - June2025.pdf
Crypto Super 500 - 14th Report - June2025.pdf
Stephen Perrenod
?
The State of Web3 Industry- Industry Report
The State of Web3 Industry- Industry Report
Liveplex
?
Mastering AI Workflows with FME - Peak of Data & AI 2025
Mastering AI Workflows with FME - Peak of Data & AI 2025
Safe Software
?
Down the Rabbit Hole ¨C Solving 5 Training Roadblocks
Down the Rabbit Hole ¨C Solving 5 Training Roadblocks
Rustici Software
?
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Puppy jhon
?
June Patch Tuesday
June Patch Tuesday
Ivanti
?
FME for Good: Integrating Multiple Data Sources with APIs to Support Local Ch...
FME for Good: Integrating Multiple Data Sources with APIs to Support Local Ch...
Safe Software
?
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Alliance
?
Edge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdf
AmirStern2
?
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
?
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
Safe Software
?
Murdledescargadarkweb.pdfvolumen1 100 elementary
Murdledescargadarkweb.pdfvolumen1 100 elementary
JorgeSemperteguiMont
?
Enabling BIM / GIS integrations with Other Systems with FME
Enabling BIM / GIS integrations with Other Systems with FME
Safe Software
?
High Availability On-Premises FME Flow.pdf
High Availability On-Premises FME Flow.pdf
Safe Software
?
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Safe Software
?
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance
?
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Alliance
?
¡°Why It¡¯s Critical to Have an Integrated Development Methodology for Edge AI,...
¡°Why It¡¯s Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
?
¡°Addressing Evolving AI Model Challenges Through Memory and Storage,¡± a Prese...
¡°Addressing Evolving AI Model Challenges Through Memory and Storage,¡± a Prese...
Edge AI and Vision Alliance
?
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
?
Crypto Super 500 - 14th Report - June2025.pdf
Crypto Super 500 - 14th Report - June2025.pdf
Stephen Perrenod
?
The State of Web3 Industry- Industry Report
The State of Web3 Industry- Industry Report
Liveplex
?
Mastering AI Workflows with FME - Peak of Data & AI 2025
Mastering AI Workflows with FME - Peak of Data & AI 2025
Safe Software
?
Down the Rabbit Hole ¨C Solving 5 Training Roadblocks
Down the Rabbit Hole ¨C Solving 5 Training Roadblocks
Rustici Software
?
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Puppy jhon
?
June Patch Tuesday
June Patch Tuesday
Ivanti
?
Ad

Preprocessing

  • 1. Steps involved in Preprocessing : 1.Tokenization : ¡ñ Tokenization : The process of breaking a stream of text into words ¡ñ Removal of Punctuation marks and numbers ¡ñ Replacing ¡®n¡¯ by Spaces ¡ñ Splitting the string by space as a delimiter ¡ñ Tokens
  • 2. Graphical view of steps in Tokenization : Removal of Replacing n Using Stream of text. spaces as punctuation by Spaces marks delimiter Tokens (words)
  • 3. 2. Removal of stop words : ¡ñ Passing the list of Tokens. ¡ñ Removing the unnecessary words like the, an, so, after, all, etc (stop words). ¡ñ Output : A list of meaningful words.
  • 4. 3.Stemming : ¡ñ Stemming : The process for reducing inflected words to their stem, base or root form. For example : Stemming algorithm reduces ¡°fishing", "fished", "fish", and "fisher" to the root word "fish¡°. ¡ñ Stemmer used : Porter Stemmer Algorithm. ¡ñ Removing ¡®¨Cee¡¯,¡¯ ¨Ced¡¯, ¡®-ing¡¯, ¡®-ence¡¯, ¡®-er¡¯, etc. & adding ¡®y¡¯, ¡®I¡¯ as required. ¡ñ Doesn¡¯t give accurate roots . Example : stem(flying) =fli stem(fly)=fli ¡ñ Same roots for all inflected forms ¨C serves our purpose
  • 5. 4. Vocabulary creation :- ¡ñ Vocabulary : Generally, vocabulary is the set of words. ¡ñ Vocabulary = Union of words from all files. ¡ñ For each document : Converting list obtained after stemming into Set & taking union. ¡ñ Processed further for Tf-idf evaluation.