際際滷

際際滷Share a Scribd company logo
7th Author Pro鍖ling task at PAN
Bots and Gender Pro鍖ling
in Twitter
PAN-AP-2019 CLEF 2019
Lugano, 09-12 September
Francisco Rangel
Autoritas Consulting &
PRHLT Research Center -
Universitat Polit竪cnica de Val竪ncia
Paolo Rosso
PRHLT Research Center
Universitat Polit竪cnica de Valencia
Introduction
Author pro鍖ling aims at identifying
personal traits such as age, gender,
personality traits, native language,
language variety from writings?
This is crucial for:
- Marketing.
- Security.
- Forensics.
2
Author
Profiling
PAN19
Task goal
Given a Twitter feed, determine whether
its author is a bot or a human. In case of
human, identify her/his gender.
3
Author
Profiling
Two languages:
English Spanish
PAN19
Corpus
4
Author
Profiling
PAN19
Existent datasets:
- Varol, Cresci...
Newly discovered:
- I'm a bot
Still
exists?
Manual
annot.
DISCARDED
INCLUDED
YES YES
NO NO
 Humans selected
from PAN-AP'17 +
manual annotation
BOTS
Corpus
5
Author
Profiling
 Each author (bot or human) is composed by exactly 100 tweets.
PAN19
(EN) English (ES) Spanish
Bots
Humans
Total Bots
Humans
Total
F M F M
Training Training 1,440 720 720 2,880 1,040 520 520 2,080
Development 620 310 310 1,240 460 230 230 920
Total 2,060 1,030 1,030 4,120 1,500 750 750 3,000
Test 1,320 660 660 2,640 900 450 450 1,800
Total 3,380 1,690 1,690 6,760 2,400 1,200 1,200 4,800
Corpus
6
Author
Profiling
 Four classes of bots:
PAN19
TEMPLATE The Twitter feed responds to a predefined structure or template, such as for
example a Twitter account giving the state of the earthquakes in a region or
job offers in a sector
FEED The Twitter feed retweets or shares news about a predefined topic, such as
for example regarding Trump's policies
QUOTE The Twitter feed reproduces quotes from famous books or songs, quotes
from celebrities (or historical) people, or jokes
ADVANCED Twitter feeds whose language is generated on the basis of more elaborated
technologies such as Markov chains, metaphors, or in some cases, randomly
choosing and merging texts from big corpus
Corpus
7
Author
Profiling
PAN19
For example, the bot
@metaphormagnet
was developed by
Tony Veale and Goufu Li
to automatically generate
metaphorical language.
Evaluation measures
8
Author
Profiling
PAN19
The accuracy is calculated per language and task:
Bot or
human?
Female
or male?
human
acc
acc
Baselines
9
Author
Profiling
PAN19
MAJORITY A statistical baseline that always predicts the majority class in the training set. In
case of balanced classes, it predicts one of them
RANDOM A baseline that randomly generates the predictions among the different classes
CHAR N-GRAMS With values for $n$ from 1 to 10, and selecting the 100, 200, 500, 1,000, 2,000,
5,000 and 10,000 most frequent ones
WORD N-GRAMS With values for $n$ from 1 to 10, and selecting the 100, 200, 500, 1,000, 2,000,
5,000 and 10,000 most frequent ones
W2V Texts are represented with two word embedding models: textit{i)} Continuous
Bag of Words (CBOW); and textit{ii)} Skip-Grams
LDSE This method represents documents on the basis of the probability distribution of
occurrence of their words in the different classes. The key concept of LDSE is a
weight, representing the probability of a term to belong to one of the different
categories: human / bot, male / female. The distribution of weights for a given
document should be closer to the weights of its corresponding category. LDSE
takes advantage of the whole vocabulary
56 participants
46 working notes
26 countries 10
Author
Profiling
PAN19
https://mapchart.net/world.html
Approaches
11
Author
Profiling
PAN19
Approaches - Preprocessing
12
Author
Profiling
Twitter elements (URLs,
users, hashtags, ...)
Van Halteren; Vogel; Polignano; Giachanou; Gishamer; Puertas; Saeed; Petritk; Valencia;
Onose; Babaei; Yacob; Zhechev; Mahmood
Word segmentation Gishamer; Joo
Tokenisation Van Halteren; Polignano; Gishamer; Joo; Bacciu; Petritk; Goubin; Zhechev; Mahmood
Stemming / lemmatisation Ikae; Joo; Saeed; Bacciu; Basile; Petritk; Babaei; Goubin; Zhechev;
Punctuation marks Vogel; Saeed; Onose; Ribeiro; Goubin; Yacob; Zhechev;
Lowercase Van Halteren; Vogel; Giachanou; Saeed; Ribeiro
Stopwords Joo; Saeed; Babaei; Zhechev;
Character flooding Vogel; Gishamer; Goubin
LSA Rakesh
Short words Vogel
Infrequent words Ikae; Gishamer
Contractions and acronyms Joo; Saeed
PAN19
Approaches - Features
13
Author
Profiling
Stylistic features:
- Number of occurrences
- Verbs, adjs, pronouns
- Number of hashtags, mentions,
URLs...
- Capital vs. lower letters
- Punctuation marks
- ...
Joo; Goubin; Ashraf; Cimino; Oliveira; Ikae; De la Pe単a; Johansson;
Giachanou; Martinc; Przybyla; Van Halteren; Fernquist
N-gram models Ispas; Bounaama; Rakesh; Valencia; Mahmood; Fahim; Espinosa;
Pizarro; Martinc; Martinc; Dias; Vogel; Giachanou; De la Pe単a;
Babaei; Saeed; Joo; Bacciu; Johansson; Fernquist; HaCohen;
Gishamer
Emotional features Cimino; Giachanou; Oliveira
Lexicon-based features Gamallo
Compression algorithms Fernquist
DNA-based approach Kosmajac
Embeddings Polignano; Fagni; Halvani; Onose; L坦pez-Santill叩n; Staykovsky; Joo
PAN19
Approaches - Methods
14
Author
Profiling
SVM Vogel; Cimino; Fagni; Pizarro; Jimenez; HaCohen; Bacciu; Goubin; Srinivasarao;
Mahmood; Yacob; Ribeiro; Babaei; Rakesh; Gishamer; Moryossef; Giachanou
Logistic
regression
Gishamer; Moryossef;
Valencia; Bolonyai; Przybya
CatBoost Fernquist
SpaCy Moryossef kNN Ikae
Random Forest Moryossef; Johansson Multilayer
Perceptron
Staykovski
SGD Giachanou; Bounaama RNN Dias; Petrik; Bolonyai; Onose
Decision Trees Saeed CNN Dias; Petrik; Polignano; Farber
Multinomial
BayesNet
Saeed BERT Joo
Naive Bayes Gamallo Feedforward NN Halvani; De la Pe単a
Adaboost Bacciu LSTM Zhechev
PAN19
Global ranking
15
Author
Profiling
PAN19
Global ranking
16
Author
Profiling
PAN19
Global ranking
17
Author
Profiling
PAN19
Confusion matrices
18
Author
Profiling
PAN19
ENGLISH
SPANISH
Confusion matrices
19
Author
Profiling
PAN19
ENGLISH
SPANISH
Errors per
bot type
20
Author
Profiling
PAN19
Errors per
bot type
21
Author
Profiling
PAN19
Errors per
bot type
22
Author
Profiling
PAN19
Errors per
bot type
23
Author
Profiling
PAN19
Errors per
bot type
24
Author
Profiling
PAN19
Bot to Human per Gender Errors
25
Author
Profiling
PAN19
Bot to Human per Gender Errors
26
Author
Profiling
PAN19
Bot to Human per Gender Errors
27
Author
Profiling
PAN19
Bot to Human per Gender Errors
28
Author
Profiling
PAN19
Human to
Bot Errors
29
Author
Profiling
PAN19
Human to
Bot Errors
30
Author
Profiling
PAN19
Human to
Bot Errors
31
Author
Profiling
PAN19
Human to
Bot Errors
32
Author
Profiling
PAN19
Best results at PAN'19
33
Author
Profiling
PAN19
Johansson
- Stylistic features
- Random Forest
Valencia
- n-grams
- Logistic Regression
Pizarro
- n-grams
- Support Vector Machines
Conclusions
 Several approaches to tackle the task:
 n-Grams + SVM prevailing.
 Best results in bots vs. human:
 Over 84% on average (EN 86.15%; ES 84.08%).
 English (95.95%): Johansson - Stylistic features + Random Forest.
 Spanish (93.33%): Pizarro - n-Grams + SVM.
 Best results in gender identi鍖cation:
 Over 70% on average (EN: 72.79%; ES: 70.17%).
 English (84.17%): Valencia - n-Grams + Logistic Regression.
 Spanish (81.72%): Pizarro - n-Grams + SVM.
 Error analysis:
 Highest confusion from bots to humans (17.15% vs. 7.86% EN; 14.45% vs. 14.08% ES).
 ...mainly towards males (9.83% vs. 7.53% EN; 8.50% vs. 5.02% ES).
 ...males more confused with bots (8.85% vs. 3.55% EN; 18.93% vs. 11.61% ES).
 Within genders:
 EN: males to females (27.56%) vs. females to males (26.67%).
 ES: males to females (21.03%) vs. females to males (11.61%).
 Error per bot type:
 Advanced bots: 30.11% EN; 32.38% ES.
 EN: quote (12.64%); template (17.94%); feed (27.89%).
 ES: quote (26.51%); template (13.20%); feed (14.28%).
 Mainly towards males, except quote bots in ES (6.75% vs. 15.29% towards males). 34
Author
Profiling
PAN19
Conclusions
Looking at the results, we can conclude:
 It is feasible to automatically identify bots in Twitter with
high precision
 ...even when only textual features are used.
 There are speci鍖c cases where the task is di鍖cult due to:
 ...the language used by the bots (e.g., advanced bots).
 ...the way the humans use the platform (e.g., to share
news).
In both cases, although the precision is high, a major e鍖ort
needs to be made to take into account false positives.
35
Author
Profiling
PAN19
36
Author
Profiling
PAN19
Industry at PAN (Author Pro鍖ling)
37
Author
Profiling
Organisation
Sponsors
PAN19
2020 -> FAKE news spreadeRS
38
Author
Profiling
PAN19
39
Author
Profiling
On behalf of the author pro鍖ling task organisers:
Thank you very much for participating
and hope to see you next year!!
PAN19

More Related Content

Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling in Twitter

  • 1. 7th Author Pro鍖ling task at PAN Bots and Gender Pro鍖ling in Twitter PAN-AP-2019 CLEF 2019 Lugano, 09-12 September Francisco Rangel Autoritas Consulting & PRHLT Research Center - Universitat Polit竪cnica de Val竪ncia Paolo Rosso PRHLT Research Center Universitat Polit竪cnica de Valencia
  • 2. Introduction Author pro鍖ling aims at identifying personal traits such as age, gender, personality traits, native language, language variety from writings? This is crucial for: - Marketing. - Security. - Forensics. 2 Author Profiling PAN19
  • 3. Task goal Given a Twitter feed, determine whether its author is a bot or a human. In case of human, identify her/his gender. 3 Author Profiling Two languages: English Spanish PAN19
  • 4. Corpus 4 Author Profiling PAN19 Existent datasets: - Varol, Cresci... Newly discovered: - I'm a bot Still exists? Manual annot. DISCARDED INCLUDED YES YES NO NO Humans selected from PAN-AP'17 + manual annotation BOTS
  • 5. Corpus 5 Author Profiling Each author (bot or human) is composed by exactly 100 tweets. PAN19 (EN) English (ES) Spanish Bots Humans Total Bots Humans Total F M F M Training Training 1,440 720 720 2,880 1,040 520 520 2,080 Development 620 310 310 1,240 460 230 230 920 Total 2,060 1,030 1,030 4,120 1,500 750 750 3,000 Test 1,320 660 660 2,640 900 450 450 1,800 Total 3,380 1,690 1,690 6,760 2,400 1,200 1,200 4,800
  • 6. Corpus 6 Author Profiling Four classes of bots: PAN19 TEMPLATE The Twitter feed responds to a predefined structure or template, such as for example a Twitter account giving the state of the earthquakes in a region or job offers in a sector FEED The Twitter feed retweets or shares news about a predefined topic, such as for example regarding Trump's policies QUOTE The Twitter feed reproduces quotes from famous books or songs, quotes from celebrities (or historical) people, or jokes ADVANCED Twitter feeds whose language is generated on the basis of more elaborated technologies such as Markov chains, metaphors, or in some cases, randomly choosing and merging texts from big corpus
  • 7. Corpus 7 Author Profiling PAN19 For example, the bot @metaphormagnet was developed by Tony Veale and Goufu Li to automatically generate metaphorical language.
  • 8. Evaluation measures 8 Author Profiling PAN19 The accuracy is calculated per language and task: Bot or human? Female or male? human acc acc
  • 9. Baselines 9 Author Profiling PAN19 MAJORITY A statistical baseline that always predicts the majority class in the training set. In case of balanced classes, it predicts one of them RANDOM A baseline that randomly generates the predictions among the different classes CHAR N-GRAMS With values for $n$ from 1 to 10, and selecting the 100, 200, 500, 1,000, 2,000, 5,000 and 10,000 most frequent ones WORD N-GRAMS With values for $n$ from 1 to 10, and selecting the 100, 200, 500, 1,000, 2,000, 5,000 and 10,000 most frequent ones W2V Texts are represented with two word embedding models: textit{i)} Continuous Bag of Words (CBOW); and textit{ii)} Skip-Grams LDSE This method represents documents on the basis of the probability distribution of occurrence of their words in the different classes. The key concept of LDSE is a weight, representing the probability of a term to belong to one of the different categories: human / bot, male / female. The distribution of weights for a given document should be closer to the weights of its corresponding category. LDSE takes advantage of the whole vocabulary
  • 10. 56 participants 46 working notes 26 countries 10 Author Profiling PAN19 https://mapchart.net/world.html
  • 12. Approaches - Preprocessing 12 Author Profiling Twitter elements (URLs, users, hashtags, ...) Van Halteren; Vogel; Polignano; Giachanou; Gishamer; Puertas; Saeed; Petritk; Valencia; Onose; Babaei; Yacob; Zhechev; Mahmood Word segmentation Gishamer; Joo Tokenisation Van Halteren; Polignano; Gishamer; Joo; Bacciu; Petritk; Goubin; Zhechev; Mahmood Stemming / lemmatisation Ikae; Joo; Saeed; Bacciu; Basile; Petritk; Babaei; Goubin; Zhechev; Punctuation marks Vogel; Saeed; Onose; Ribeiro; Goubin; Yacob; Zhechev; Lowercase Van Halteren; Vogel; Giachanou; Saeed; Ribeiro Stopwords Joo; Saeed; Babaei; Zhechev; Character flooding Vogel; Gishamer; Goubin LSA Rakesh Short words Vogel Infrequent words Ikae; Gishamer Contractions and acronyms Joo; Saeed PAN19
  • 13. Approaches - Features 13 Author Profiling Stylistic features: - Number of occurrences - Verbs, adjs, pronouns - Number of hashtags, mentions, URLs... - Capital vs. lower letters - Punctuation marks - ... Joo; Goubin; Ashraf; Cimino; Oliveira; Ikae; De la Pe単a; Johansson; Giachanou; Martinc; Przybyla; Van Halteren; Fernquist N-gram models Ispas; Bounaama; Rakesh; Valencia; Mahmood; Fahim; Espinosa; Pizarro; Martinc; Martinc; Dias; Vogel; Giachanou; De la Pe単a; Babaei; Saeed; Joo; Bacciu; Johansson; Fernquist; HaCohen; Gishamer Emotional features Cimino; Giachanou; Oliveira Lexicon-based features Gamallo Compression algorithms Fernquist DNA-based approach Kosmajac Embeddings Polignano; Fagni; Halvani; Onose; L坦pez-Santill叩n; Staykovsky; Joo PAN19
  • 14. Approaches - Methods 14 Author Profiling SVM Vogel; Cimino; Fagni; Pizarro; Jimenez; HaCohen; Bacciu; Goubin; Srinivasarao; Mahmood; Yacob; Ribeiro; Babaei; Rakesh; Gishamer; Moryossef; Giachanou Logistic regression Gishamer; Moryossef; Valencia; Bolonyai; Przybya CatBoost Fernquist SpaCy Moryossef kNN Ikae Random Forest Moryossef; Johansson Multilayer Perceptron Staykovski SGD Giachanou; Bounaama RNN Dias; Petrik; Bolonyai; Onose Decision Trees Saeed CNN Dias; Petrik; Polignano; Farber Multinomial BayesNet Saeed BERT Joo Naive Bayes Gamallo Feedforward NN Halvani; De la Pe単a Adaboost Bacciu LSTM Zhechev PAN19
  • 25. Bot to Human per Gender Errors 25 Author Profiling PAN19
  • 26. Bot to Human per Gender Errors 26 Author Profiling PAN19
  • 27. Bot to Human per Gender Errors 27 Author Profiling PAN19
  • 28. Bot to Human per Gender Errors 28 Author Profiling PAN19
  • 33. Best results at PAN'19 33 Author Profiling PAN19 Johansson - Stylistic features - Random Forest Valencia - n-grams - Logistic Regression Pizarro - n-grams - Support Vector Machines
  • 34. Conclusions Several approaches to tackle the task: n-Grams + SVM prevailing. Best results in bots vs. human: Over 84% on average (EN 86.15%; ES 84.08%). English (95.95%): Johansson - Stylistic features + Random Forest. Spanish (93.33%): Pizarro - n-Grams + SVM. Best results in gender identi鍖cation: Over 70% on average (EN: 72.79%; ES: 70.17%). English (84.17%): Valencia - n-Grams + Logistic Regression. Spanish (81.72%): Pizarro - n-Grams + SVM. Error analysis: Highest confusion from bots to humans (17.15% vs. 7.86% EN; 14.45% vs. 14.08% ES). ...mainly towards males (9.83% vs. 7.53% EN; 8.50% vs. 5.02% ES). ...males more confused with bots (8.85% vs. 3.55% EN; 18.93% vs. 11.61% ES). Within genders: EN: males to females (27.56%) vs. females to males (26.67%). ES: males to females (21.03%) vs. females to males (11.61%). Error per bot type: Advanced bots: 30.11% EN; 32.38% ES. EN: quote (12.64%); template (17.94%); feed (27.89%). ES: quote (26.51%); template (13.20%); feed (14.28%). Mainly towards males, except quote bots in ES (6.75% vs. 15.29% towards males). 34 Author Profiling PAN19
  • 35. Conclusions Looking at the results, we can conclude: It is feasible to automatically identify bots in Twitter with high precision ...even when only textual features are used. There are speci鍖c cases where the task is di鍖cult due to: ...the language used by the bots (e.g., advanced bots). ...the way the humans use the platform (e.g., to share news). In both cases, although the precision is high, a major e鍖ort needs to be made to take into account false positives. 35 Author Profiling PAN19
  • 37. Industry at PAN (Author Pro鍖ling) 37 Author Profiling Organisation Sponsors PAN19
  • 38. 2020 -> FAKE news spreadeRS 38 Author Profiling PAN19
  • 39. 39 Author Profiling On behalf of the author pro鍖ling task organisers: Thank you very much for participating and hope to see you next year!! PAN19