際際滷

際際滷Share a Scribd company logo
Gender Detection in Blogs
(Project Number - 17)
A Project Report
Submitted by
Group Number - 37
Subba Reddy 201406632
Rashmi Sharma 201405581
Abhijeet Thakur 201264203
Guided by
Dr. Vasudev Verma
Mentored by
Vishrut Mehta
For the course
Information Retrieval and Extraction
IIIT, Hyderabad
April, 2015
1. Abstract
The question addressed in this paper is : given a short text document, can we identify
if the author is a man or a woman? This question is motivated by recent events where
people faked their gender on the Internet. Note that this is di鍖erent from the authorship
attribution problem.
Three machine learning algorithms (support vector machine, Bayesian logistic regres-
sion and AdaBoost decision tree) are then designed for gender identi鍖cation based on 545
psycho-linguistic and gender-preferential cues along with the stylometric features.
Out of these three - support vector machine gives the highest accuracy of 85.1% in
gender identi鍖cation.
2. Project Scope
The goal of this project is, given a blog, you need to analyze the speci鍖c features in
the text di鍖erentiating whether it is written by a male or a female.
The features can be anything, for example, if a blog is about dresses, or cats then it
may be written by a female, and if a blog is about sports, suits, etc then it would be
written by a male. But in this project, you should also analyze the salient features which
di鍖erentiate the text content and not merely on the topic of the text.
3. Related Systems
 Authorship identi鍖cation : Authorship is calculated by determining if one piece
of text contained signi鍖cantly longer words than another. Histograms of word-
length distribution were also used for the same.
 Gender Guesser : This tool attempts to determine an authors gender based on
the words used. Submitted text is evaluated based on two types of writing: formal
and informal. Formal writing includes 鍖ction and non-鍖ction stories, articles, and
news reports. Informal writing includes blog and chat-room text.
1
 Author gender identi鍖cation from text : In a research researchers presented
a group of lexical, syntactic and pragmatic features, which would distinguish the
language style of women, namely, the use of specialized vocabulary, expletives, tag.
4. Proposed System / Approach
 Collecting a suitable corpus of text messages to be the dataset.
 Identifying features that are signi鍖cant indicators of gender.
 Extracting feature values from each message automatically.
 Building a classi鍖cation model to identify the authors gender of a candidate text
message.
Figure 4.1: Gender Identi鍖cation Process
2
5. Dataset
We will be using the dataset from the proceedings of PAN 2013 and 2014. The 2013
dataset comprises of blog posts while the 2014 dataset also includes tweets. The original
use of this dataset was for the problem of Author Pro鍖ling; more speci鍖cally determining
the authors age and gender.
Dataset link: http://pan.webis.de/
6. Evaluation and Analysis
 Training Phase : The classi鍖er was trained with 4 di鍖erent number of blogs :
50, 100, 200 and 500.
 Testing Phase : In each case, 70% was used for training and 30% was used for
testing.
Corpus Training Testing Accuracy
100 70 30 70.37%
200 140 60 70%
260 184 76 68.94%
500 350 150 669.76%
7. Conclusion and Future Work
By designing appropriate psycho- linguistic and gender-linked features, we observe
that word- based features, function words and structural features play important roles in
gender identi鍖cation. Experimental results indicate that the identi鍖cation performance
is improved by increasing the number of text documents in the training dataset as well
as the number of words in each document (e-mail). We 鍖nd that there are signi鍖cant
di鍖erences between men and women in personal writings such as e-mails, and gender
di鍖erences also exist between authors of news articles even though neutral language is
dominant there.
3

More Related Content

SubbuProjectReport

  • 1. Gender Detection in Blogs (Project Number - 17) A Project Report Submitted by Group Number - 37 Subba Reddy 201406632 Rashmi Sharma 201405581 Abhijeet Thakur 201264203 Guided by Dr. Vasudev Verma Mentored by Vishrut Mehta For the course Information Retrieval and Extraction IIIT, Hyderabad April, 2015
  • 2. 1. Abstract The question addressed in this paper is : given a short text document, can we identify if the author is a man or a woman? This question is motivated by recent events where people faked their gender on the Internet. Note that this is di鍖erent from the authorship attribution problem. Three machine learning algorithms (support vector machine, Bayesian logistic regres- sion and AdaBoost decision tree) are then designed for gender identi鍖cation based on 545 psycho-linguistic and gender-preferential cues along with the stylometric features. Out of these three - support vector machine gives the highest accuracy of 85.1% in gender identi鍖cation. 2. Project Scope The goal of this project is, given a blog, you need to analyze the speci鍖c features in the text di鍖erentiating whether it is written by a male or a female. The features can be anything, for example, if a blog is about dresses, or cats then it may be written by a female, and if a blog is about sports, suits, etc then it would be written by a male. But in this project, you should also analyze the salient features which di鍖erentiate the text content and not merely on the topic of the text. 3. Related Systems Authorship identi鍖cation : Authorship is calculated by determining if one piece of text contained signi鍖cantly longer words than another. Histograms of word- length distribution were also used for the same. Gender Guesser : This tool attempts to determine an authors gender based on the words used. Submitted text is evaluated based on two types of writing: formal and informal. Formal writing includes 鍖ction and non-鍖ction stories, articles, and news reports. Informal writing includes blog and chat-room text. 1
  • 3. Author gender identi鍖cation from text : In a research researchers presented a group of lexical, syntactic and pragmatic features, which would distinguish the language style of women, namely, the use of specialized vocabulary, expletives, tag. 4. Proposed System / Approach Collecting a suitable corpus of text messages to be the dataset. Identifying features that are signi鍖cant indicators of gender. Extracting feature values from each message automatically. Building a classi鍖cation model to identify the authors gender of a candidate text message. Figure 4.1: Gender Identi鍖cation Process 2
  • 4. 5. Dataset We will be using the dataset from the proceedings of PAN 2013 and 2014. The 2013 dataset comprises of blog posts while the 2014 dataset also includes tweets. The original use of this dataset was for the problem of Author Pro鍖ling; more speci鍖cally determining the authors age and gender. Dataset link: http://pan.webis.de/ 6. Evaluation and Analysis Training Phase : The classi鍖er was trained with 4 di鍖erent number of blogs : 50, 100, 200 and 500. Testing Phase : In each case, 70% was used for training and 30% was used for testing. Corpus Training Testing Accuracy 100 70 30 70.37% 200 140 60 70% 260 184 76 68.94% 500 350 150 669.76% 7. Conclusion and Future Work By designing appropriate psycho- linguistic and gender-linked features, we observe that word- based features, function words and structural features play important roles in gender identi鍖cation. Experimental results indicate that the identi鍖cation performance is improved by increasing the number of text documents in the training dataset as well as the number of words in each document (e-mail). We 鍖nd that there are signi鍖cant di鍖erences between men and women in personal writings such as e-mails, and gender di鍖erences also exist between authors of news articles even though neutral language is dominant there. 3