These are the slides of the overview of the fourth Author Profiling task at PAN-CLEF 2019 presented in Lugano. This year task aimed at discriminating bots from humans in Twitter accounts, and in the case of humans, between males and females.
1 of 39
Download to read offline
More Related Content
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling in Twitter
1. 7th Author Pro鍖ling task at PAN
Bots and Gender Pro鍖ling
in Twitter
PAN-AP-2019 CLEF 2019
Lugano, 09-12 September
Francisco Rangel
Autoritas Consulting &
PRHLT Research Center -
Universitat Polit竪cnica de Val竪ncia
Paolo Rosso
PRHLT Research Center
Universitat Polit竪cnica de Valencia
2. Introduction
Author pro鍖ling aims at identifying
personal traits such as age, gender,
personality traits, native language,
language variety from writings?
This is crucial for:
- Marketing.
- Security.
- Forensics.
2
Author
Profiling
PAN19
3. Task goal
Given a Twitter feed, determine whether
its author is a bot or a human. In case of
human, identify her/his gender.
3
Author
Profiling
Two languages:
English Spanish
PAN19
5. Corpus
5
Author
Profiling
Each author (bot or human) is composed by exactly 100 tweets.
PAN19
(EN) English (ES) Spanish
Bots
Humans
Total Bots
Humans
Total
F M F M
Training Training 1,440 720 720 2,880 1,040 520 520 2,080
Development 620 310 310 1,240 460 230 230 920
Total 2,060 1,030 1,030 4,120 1,500 750 750 3,000
Test 1,320 660 660 2,640 900 450 450 1,800
Total 3,380 1,690 1,690 6,760 2,400 1,200 1,200 4,800
6. Corpus
6
Author
Profiling
Four classes of bots:
PAN19
TEMPLATE The Twitter feed responds to a predefined structure or template, such as for
example a Twitter account giving the state of the earthquakes in a region or
job offers in a sector
FEED The Twitter feed retweets or shares news about a predefined topic, such as
for example regarding Trump's policies
QUOTE The Twitter feed reproduces quotes from famous books or songs, quotes
from celebrities (or historical) people, or jokes
ADVANCED Twitter feeds whose language is generated on the basis of more elaborated
technologies such as Markov chains, metaphors, or in some cases, randomly
choosing and merging texts from big corpus
9. Baselines
9
Author
Profiling
PAN19
MAJORITY A statistical baseline that always predicts the majority class in the training set. In
case of balanced classes, it predicts one of them
RANDOM A baseline that randomly generates the predictions among the different classes
CHAR N-GRAMS With values for $n$ from 1 to 10, and selecting the 100, 200, 500, 1,000, 2,000,
5,000 and 10,000 most frequent ones
WORD N-GRAMS With values for $n$ from 1 to 10, and selecting the 100, 200, 500, 1,000, 2,000,
5,000 and 10,000 most frequent ones
W2V Texts are represented with two word embedding models: textit{i)} Continuous
Bag of Words (CBOW); and textit{ii)} Skip-Grams
LDSE This method represents documents on the basis of the probability distribution of
occurrence of their words in the different classes. The key concept of LDSE is a
weight, representing the probability of a term to belong to one of the different
categories: human / bot, male / female. The distribution of weights for a given
document should be closer to the weights of its corresponding category. LDSE
takes advantage of the whole vocabulary
33. Best results at PAN'19
33
Author
Profiling
PAN19
Johansson
- Stylistic features
- Random Forest
Valencia
- n-grams
- Logistic Regression
Pizarro
- n-grams
- Support Vector Machines
34. Conclusions
Several approaches to tackle the task:
n-Grams + SVM prevailing.
Best results in bots vs. human:
Over 84% on average (EN 86.15%; ES 84.08%).
English (95.95%): Johansson - Stylistic features + Random Forest.
Spanish (93.33%): Pizarro - n-Grams + SVM.
Best results in gender identi鍖cation:
Over 70% on average (EN: 72.79%; ES: 70.17%).
English (84.17%): Valencia - n-Grams + Logistic Regression.
Spanish (81.72%): Pizarro - n-Grams + SVM.
Error analysis:
Highest confusion from bots to humans (17.15% vs. 7.86% EN; 14.45% vs. 14.08% ES).
...mainly towards males (9.83% vs. 7.53% EN; 8.50% vs. 5.02% ES).
...males more confused with bots (8.85% vs. 3.55% EN; 18.93% vs. 11.61% ES).
Within genders:
EN: males to females (27.56%) vs. females to males (26.67%).
ES: males to females (21.03%) vs. females to males (11.61%).
Error per bot type:
Advanced bots: 30.11% EN; 32.38% ES.
EN: quote (12.64%); template (17.94%); feed (27.89%).
ES: quote (26.51%); template (13.20%); feed (14.28%).
Mainly towards males, except quote bots in ES (6.75% vs. 15.29% towards males). 34
Author
Profiling
PAN19
35. Conclusions
Looking at the results, we can conclude:
It is feasible to automatically identify bots in Twitter with
high precision
...even when only textual features are used.
There are speci鍖c cases where the task is di鍖cult due to:
...the language used by the bots (e.g., advanced bots).
...the way the humans use the platform (e.g., to share
news).
In both cases, although the precision is high, a major e鍖ort
needs to be made to take into account false positives.
35
Author
Profiling
PAN19