This document summarizes a project on gender detection in blogs. The project aimed to identify whether a blog post was written by a male or female author based on linguistic features. Three machine learning algorithms were tested on a dataset of blog posts and tweets, with support vector machines achieving the highest accuracy of 85.1% at gender identification. The proposed approach involved collecting a dataset, identifying gender-indicative features, extracting feature values, and building a classification model. Evaluation on different sized training sets showed accuracy improved with more training data. The conclusion was that word-based, functional, and structural features help identify gender, and performance increases with more training documents and words per document.
1 of 4
More Related Content
SubbuProjectReport
1. Gender Detection in Blogs
(Project Number - 17)
A Project Report
Submitted by
Group Number - 37
Subba Reddy 201406632
Rashmi Sharma 201405581
Abhijeet Thakur 201264203
Guided by
Dr. Vasudev Verma
Mentored by
Vishrut Mehta
For the course
Information Retrieval and Extraction
IIIT, Hyderabad
April, 2015
2. 1. Abstract
The question addressed in this paper is : given a short text document, can we identify
if the author is a man or a woman? This question is motivated by recent events where
people faked their gender on the Internet. Note that this is di鍖erent from the authorship
attribution problem.
Three machine learning algorithms (support vector machine, Bayesian logistic regres-
sion and AdaBoost decision tree) are then designed for gender identi鍖cation based on 545
psycho-linguistic and gender-preferential cues along with the stylometric features.
Out of these three - support vector machine gives the highest accuracy of 85.1% in
gender identi鍖cation.
2. Project Scope
The goal of this project is, given a blog, you need to analyze the speci鍖c features in
the text di鍖erentiating whether it is written by a male or a female.
The features can be anything, for example, if a blog is about dresses, or cats then it
may be written by a female, and if a blog is about sports, suits, etc then it would be
written by a male. But in this project, you should also analyze the salient features which
di鍖erentiate the text content and not merely on the topic of the text.
3. Related Systems
Authorship identi鍖cation : Authorship is calculated by determining if one piece
of text contained signi鍖cantly longer words than another. Histograms of word-
length distribution were also used for the same.
Gender Guesser : This tool attempts to determine an authors gender based on
the words used. Submitted text is evaluated based on two types of writing: formal
and informal. Formal writing includes 鍖ction and non-鍖ction stories, articles, and
news reports. Informal writing includes blog and chat-room text.
1
3. Author gender identi鍖cation from text : In a research researchers presented
a group of lexical, syntactic and pragmatic features, which would distinguish the
language style of women, namely, the use of specialized vocabulary, expletives, tag.
4. Proposed System / Approach
Collecting a suitable corpus of text messages to be the dataset.
Identifying features that are signi鍖cant indicators of gender.
Extracting feature values from each message automatically.
Building a classi鍖cation model to identify the authors gender of a candidate text
message.
Figure 4.1: Gender Identi鍖cation Process
2
4. 5. Dataset
We will be using the dataset from the proceedings of PAN 2013 and 2014. The 2013
dataset comprises of blog posts while the 2014 dataset also includes tweets. The original
use of this dataset was for the problem of Author Pro鍖ling; more speci鍖cally determining
the authors age and gender.
Dataset link: http://pan.webis.de/
6. Evaluation and Analysis
Training Phase : The classi鍖er was trained with 4 di鍖erent number of blogs :
50, 100, 200 and 500.
Testing Phase : In each case, 70% was used for training and 30% was used for
testing.
Corpus Training Testing Accuracy
100 70 30 70.37%
200 140 60 70%
260 184 76 68.94%
500 350 150 669.76%
7. Conclusion and Future Work
By designing appropriate psycho- linguistic and gender-linked features, we observe
that word- based features, function words and structural features play important roles in
gender identi鍖cation. Experimental results indicate that the identi鍖cation performance
is improved by increasing the number of text documents in the training dataset as well
as the number of words in each document (e-mail). We 鍖nd that there are signi鍖cant
di鍖erences between men and women in personal writings such as e-mails, and gender
di鍖erences also exist between authors of news articles even though neutral language is
dominant there.
3