This document presents a stemming algorithm for Bengali verbs. It uses a two module approach: 1) suffix stripping using a predefined list of suffixes, and 2) applying rules to derive the root form. The algorithm achieves an efficiency of 99.36% on 450 inflections of 14 selected root verbs. While the results are satisfactory, the authors acknowledge that testing on a larger dataset could yield better performance. They plan to expand testing to more Bengali verbs in future work.
2. CONTENTS
Introduction
Difficulty of the language
Overview of stemming
Related Work
Pictorial Representation of proposed approach
Example of each step
Module 1(Suffix Stripping)
Module 2(Applying Rules)
Explanation of module 2 with example
Algorithm
Partial view of input file
List of suffixes
Partial view of output file
Efficiency and time complexity
Conclusion and future work
2
3. INTRODUCTION
Stemming is an operation that splits a word into the
constituent root part and affix without doing
complete morphological analysis.
For example, Eating = Eat + -ing
Worked = Work + -ed
The main purpose of stemming is to reduce different
grammatical forms / word forms of a word like its
noun, adjective, verb, adverb etc. to its root form.
We can say that the goal of stemming is to reduce
inflectional forms and sometimes derivationally
related forms of a word to a common base form.
3
4. DIFFICULTY OF THE LANGUAGE
Bengali is one of the most morphologically rich language
and stemming of Bengali verb is the most problematic
area for Stemming.
Sometimes, nearly 10x5 forms for a certain verb in
Bengali may appear in different contexts.
4
5. OVERVIEW OF STEMMING
A typical simple stemmer algorithm involves
removing suffixes using a list of frequent suffixes,
while a more complex one would use morphological
knowledge to derive a stem from the words.
5
6. RELATED WORK
In 1980 Martin Porter developed the Porter Stemmer.It
uses the fact that English language suffixes are mostly a
combination of smaller and simpler suffixes.
Ramanathan and Rao (2003) proposed a lightweight
stemmer for Hindi which has used a hand crafted suffix list
and has performed longest match stripping
Dasgupta and Ng (2006) proposed unsupervised
morphological parsing of Bengali. When evaluated on a
set of 4,110 human-segmented Bengali words, the
algorithm achieves 83% success.
Majgaonker and Siddiqui (2010) developed an
unsupervised approach for Marathi stemmer. The
maximum accuracy observed is 82.5% for the statistical
suffix stripping approach.
Suba et al. (2011) proposed two stemmers for Gujarati, with
an average accuracy of about 90.7%.
6
12. ALGORITHM
STEP1: Start of algorithm.
STEP 2: Create 4 new string[] namely splits1[], splits2[ ] and splits3[ ].
STEP 3: Read the contents of the doc files and split the words by space ( )
separator.
3.1. Store the words of each sentence in splits1[ ].
3.2. Store the inflexions in splits2[ ].
3.3. Store the desired root words in splits3[ ].
STEP 4: Declare and initialize variables l1=length of splits1[ ] , 12=length of
splits2[ ] .
STEP 5: Fetch the inflected verb forms in input1[] from splits1[i] if /verb is
contained by the currently fetched word. This step is repeated 11
times.
5.1. Determine the subroot from input1[i] by repeating the steps
12 times.
5.1.1. if splits2[j] in contained in input1[i] then,
5.1.1.a. Declare variable index which stores the index
of last occurrence of splits2[j] in input1[i]. 12
13. 13
5.1.1.b. If index is greater than equal to 2 then,
5.1.1.b.i. Store the substring of input1[i] from
begindex=0 to endindex=index in input1[i].
5.1.1.b.ii. Break the loop.
5.2. Determine the actual root input1[i] by repeating the steps l1 times.
5.2.1. Check the ending kar of input1[i].
5.2.1.a. I f input1[i] ends with e-kar(爨 ), o-kar(爭 ),
a-kar(爭 ) or aa-kar( ) then, replace it
with aa-kar( ).
5.2.1.b. if length of input1[i] is less than 3, concate it
with aa-kar( ).
5.2.2. Check the starting kar of input1[i].
5.2.2.a. if input1[i] starts with e-kar (爨 ), then
replace it with a-kar(爭 ).
5.2.2.b. if input1[i] starts with u-kar ( ), then
replace it with o-kar(爭 ).
5.2.2.c. if input1[i] starts with a-kar(爭 ), then
replace it with aa-kar( ).
14. 14
STEP 6: Generate the output doc file by copying the contents of
splits1[] and concatenating it with their obtained root
words from input1[] wherever the word contains /verb.
STEP 7: Compare the obtained sentences in splits1[ ] with the
desired sentences in splits3[ ] and calculate the efficiency.
STEP 8: End of algorithm.
18. EFFICIENCY:
Dealing with 450 inflections of 14 selected root verbs, the
proposed approach gives an efficiency of 99.36%.
TIME COMPLEXITY:
The time complexity of the proposed algorithm in worst
case is O(n2).
EFFICIENCY & TIME COMPLEXITY
18
19. CONCLUSION AND FUTURE WORK
In this project, we present a lightweight stemmer for
14 selected Bengali Verbs that strips the suffixes
using a predefined suffix list, on a longest match
basis, and then finds root on basis of some rules.
Except a few cases, the result obtained from our
algorithm is quite satisfactory according to our
expectation.
We argue that a stronger and populated learning set
would invariably yield better result. In future , we
plan to test our algorithm with more sets of Bengali
verbs.
19