ݺߣ

ݺߣShare a Scribd company logo
8th Author Profiling task at PAN
Profiling Fake News Spreaders
on Twitter
PAN-AP-2020 CLEF 2020
Online, 22-25 September
Francisco Rangel
Symanto Research
Paolo Rosso
PRHLT Research Center
Universitat Politècnica de Valencia
Bilal Ghanem
Symanto Research
Anastasia Giachanou
PRHLT Research Center
Universitat Politècnica de Valencia
Introduction
Author profiling aims at identifying
personal traits such as age, gender,
personality traits, native language,
language variety… from writings?
This is crucial for:
- Marketing.
- Security.
- Forensics.
2
Author
Profiling
PAN’20
Task goal
Given a Twitter feed, determine whether
its author is keen to spread fake news or
not.
3
Author
Profiling
Two languages:
English Spanish
PAN’20
Corpus
4
Author
Profiling
PAN’20
(EN) English (ES) Spanish
Keen to spread
fake news
Not keen to spread
fake news
Total
Keen to spread
fake news
Not keen to
spread fake news
Total
Training 150 150 300 150 150 300
Test 100 100 200 100 100 200
Total 250 250 500 250 250 500
Methodology
1. Selection of fake news from Politifact and Snopes related sites (+ manual review).
2. Collection of tweets responding to the previous news:
2.1. Manual inspection to ensure that the tweet refers to the news.
2.2. Manual annotation of those tweets supporting vs. rejecting the news.
3. Timeline collection
3.1. Manual review of the tweets to label the fake ones.
3.2. Users with one of more fake tweets are keen to spread them. Otherwise, they are not.
3.3. Removal of tweets referring explicitly to the fake news (to avoid bias).
Evaluation measures
5
Author
Profiling
PAN’20
The accuracy is calculated per language and averaged:
Baselines
6
Author
Profiling
PAN’20
RANDOM A baseline that randomly generates the predictions among the different classes
LSTM An Long Short-Term Memory neural network that uses FastTex embeddings to
represent texts.
CHAR N-GRAMS With values for $n$ from 2 to 6, with a SVM
WORD N-GRAMS With values for $n$ from 1 to 3, with a Neural Network
EIN The Emotionally-Infused Neural (EIN) network with word embedding and
emotional features as the input of an LSTM
Symanto (LDSE) This method represents documents on the basis of the probability distribution of
occurrence of their words in the different classes. The key concept of LDSE is a
weight, representing the probability of a term to belong to one of the different
categories: fake news spreaders / non-spreader. The distribution of weights for
a given document should be closer to the weights of its corresponding category.
LDSE takes advantage of the whole vocabulary
66 participants
33 working notes
22 countries
7
Author
Profiling
PAN’20
Participation
https://mapchart.net/world.html
Approaches
8
Author
Profiling
PAN’20
Approaches - Preprocessing
9
Author
Profiling
Twitter elements (RT, VIA,
FAV)
Giglou; Hashemi; Pinnaparaju
Emojis and other
non-alphanumeric chars
Buda; Pinnaparaju; Vogel; Giglou; Espinosa; Majumder; Lichouri; Shashirekha
Lemmatisation Giglou; Hashemi; Lichouri; Shashirekha
Tokenisation Vogel; Labadie; Fernández; Espinosa; Lichouri; Shashirekha; Baruah
Punctuation signs Vogel; Koloski; Giglou; Espinosa; Hashemi; Lichouri; Shashirekha
Numbers Pizarro; Vogel; Giglou; Espinosa; Hashemi; Shashirekha
Lowercase Buda; Pizarro; Vogel; Pinnaparaju
Stopwords Vogel; Koloski; Giglou; Espinosa; Hashemi; Lichouri; Shashirekha
Character flooding Vogel; Labadie
Infrequent terms Ikade
Short texts Vogel
PAN’20
Approaches - Features
10
Author
Profiling
Stylistic features:
- Number of occurrences
- Verbs, adjs, pronouns
- Number of hashtags, mentions,
URLs...
- Capital vs. lower letters
- Punctuation marks
- ...
Manna; Buda; Lichouri; Justin; Niven; Russo; Hörtenhuemer;
Cardaioli; Spezanno; Ogaltsov; Labadie; Hashemi;
Moreno-Sandoval;
N-gram models Pizarro; Espinosa; Vogel; Koloski; López-Fernández; Vijayasaradhi;
Buda; Lichouri; Justin; Hörtenhuemer; Spezanno; Aguirrezabal;
Shashirekha; Babaei; Labadie; Hashemi;
Emotional and personality features Justin; Niven; Russo; Hörtenhuemer; Espinosa; Cardaioli;
Spezanno; Moreno-Sandoval;
Embeddings Justin; Hörtenhuemer; Aguirrezabal; Ogaltsov; Shashirekha;
Babaei; Labadie; Hashemi; Cilet; Majumder;
...BERT Spezanno; Kaushik; Baruah; Chien;
PAN’20
* 9 teams have used Symanto API to obtain psycholinguistic and/or emotional features
Approaches - Methods
11
Author
Profiling
SVM Pizarro; Vogel; Koloski; Espinosa; Fernández; Hashemi; Lichouri;
Aguirrezabal; Fersini
Logistic regression Buda; Vogel; Koloski; Hörtennhuemer; Pinnaparaju; Aguirrezabal; Manna
Random Forest Cardaioli; Espinosa; Hashemi; Aguirrezabal; Sandoval; Manna
Ensembles Ikade; Shrestha; Shashirekha; Niven
Multilayer Perceptron Aguerrizabal
NN with Dense Layer Baruah
Fully-Connected NN Giglou
CNN Chilet
LSTM Majumder; Labadie
bi-LSTM Saeed
Ensemble (GRU + CNN) Bakhteev
PAN’20
Global ranking
12
Author
Profiling
PAN’20
Confusion matrices
13
Author
Profiling
PAN’20
ENGLISH
SPANISH
Best results at PAN'20
14
Author
Profiling
PAN’20
Buda and Bolonyai
- n-Grams
- Stylistic features
- Logistic Regression ensemble
Pizarro
- word and char n-grams
- SVM
Conclusions
● Several approaches to tackle the task:
○ n-Grams + SVM prevailing.
● Best results in English:
○ Over 67% on average.
○ Best (75%): Buda and Bolonyai - n-Grams + Stylistic features + Logistic Regression ensemble
● Best results in Spanish:
○ Over 73% on average.
○ Best (82%): Pizarro - char & word n-Grams + SVM.
● Error analysis:
○ English:
■ False positives (real news spreaders as fake news spreaders): 35.50%
■ False negatives (fake news spreaders as real news spreaders): 30.03%
○ Spanish:
■ False positives (real news spreaders as fake news spreaders): 20.23%
■ False negatives (fake news spreaders as real news spreaders): 35.09%
Looking at the results, we can conclude:
● It is feasible to automatically identify Fake News Spreaders with high precision
○ ...even when only textual features are used.
● We have to bear in mind false positives since especially in English, they sum up to one-third of the
total predictions, and misclassification might lead to ethical or legal implications.
15
Author
Profiling
PAN’20
16
Author
Profiling
PAN’20
Industry at PAN (Author Profiling)
17
Author
Profiling
Organisation
Sponsors
PAN’20
This year, the winners of the task are (ex aequo):
● Jakab Buda and Flora Bolonyai, Eötvös
Loránd University, Hungary
● Juan Pizarro, Chile
2021 -> HATE
speech spreadeRS
18
Author
Profiling
PAN’20
19
Author
Profiling
On behalf of the author profiling task organisers:
Thank you very much for participating
and hope to see you next year!!
PAN’20

More Related Content

Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreaders on Twitter

  • 1. 8th Author Profiling task at PAN Profiling Fake News Spreaders on Twitter PAN-AP-2020 CLEF 2020 Online, 22-25 September Francisco Rangel Symanto Research Paolo Rosso PRHLT Research Center Universitat Politècnica de Valencia Bilal Ghanem Symanto Research Anastasia Giachanou PRHLT Research Center Universitat Politècnica de Valencia
  • 2. Introduction Author profiling aims at identifying personal traits such as age, gender, personality traits, native language, language variety… from writings? This is crucial for: - Marketing. - Security. - Forensics. 2 Author Profiling PAN’20
  • 3. Task goal Given a Twitter feed, determine whether its author is keen to spread fake news or not. 3 Author Profiling Two languages: English Spanish PAN’20
  • 4. Corpus 4 Author Profiling PAN’20 (EN) English (ES) Spanish Keen to spread fake news Not keen to spread fake news Total Keen to spread fake news Not keen to spread fake news Total Training 150 150 300 150 150 300 Test 100 100 200 100 100 200 Total 250 250 500 250 250 500 Methodology 1. Selection of fake news from Politifact and Snopes related sites (+ manual review). 2. Collection of tweets responding to the previous news: 2.1. Manual inspection to ensure that the tweet refers to the news. 2.2. Manual annotation of those tweets supporting vs. rejecting the news. 3. Timeline collection 3.1. Manual review of the tweets to label the fake ones. 3.2. Users with one of more fake tweets are keen to spread them. Otherwise, they are not. 3.3. Removal of tweets referring explicitly to the fake news (to avoid bias).
  • 5. Evaluation measures 5 Author Profiling PAN’20 The accuracy is calculated per language and averaged:
  • 6. Baselines 6 Author Profiling PAN’20 RANDOM A baseline that randomly generates the predictions among the different classes LSTM An Long Short-Term Memory neural network that uses FastTex embeddings to represent texts. CHAR N-GRAMS With values for $n$ from 2 to 6, with a SVM WORD N-GRAMS With values for $n$ from 1 to 3, with a Neural Network EIN The Emotionally-Infused Neural (EIN) network with word embedding and emotional features as the input of an LSTM Symanto (LDSE) This method represents documents on the basis of the probability distribution of occurrence of their words in the different classes. The key concept of LDSE is a weight, representing the probability of a term to belong to one of the different categories: fake news spreaders / non-spreader. The distribution of weights for a given document should be closer to the weights of its corresponding category. LDSE takes advantage of the whole vocabulary
  • 7. 66 participants 33 working notes 22 countries 7 Author Profiling PAN’20 Participation https://mapchart.net/world.html
  • 9. Approaches - Preprocessing 9 Author Profiling Twitter elements (RT, VIA, FAV) Giglou; Hashemi; Pinnaparaju Emojis and other non-alphanumeric chars Buda; Pinnaparaju; Vogel; Giglou; Espinosa; Majumder; Lichouri; Shashirekha Lemmatisation Giglou; Hashemi; Lichouri; Shashirekha Tokenisation Vogel; Labadie; Fernández; Espinosa; Lichouri; Shashirekha; Baruah Punctuation signs Vogel; Koloski; Giglou; Espinosa; Hashemi; Lichouri; Shashirekha Numbers Pizarro; Vogel; Giglou; Espinosa; Hashemi; Shashirekha Lowercase Buda; Pizarro; Vogel; Pinnaparaju Stopwords Vogel; Koloski; Giglou; Espinosa; Hashemi; Lichouri; Shashirekha Character flooding Vogel; Labadie Infrequent terms Ikade Short texts Vogel PAN’20
  • 10. Approaches - Features 10 Author Profiling Stylistic features: - Number of occurrences - Verbs, adjs, pronouns - Number of hashtags, mentions, URLs... - Capital vs. lower letters - Punctuation marks - ... Manna; Buda; Lichouri; Justin; Niven; Russo; Hörtenhuemer; Cardaioli; Spezanno; Ogaltsov; Labadie; Hashemi; Moreno-Sandoval; N-gram models Pizarro; Espinosa; Vogel; Koloski; López-Fernández; Vijayasaradhi; Buda; Lichouri; Justin; Hörtenhuemer; Spezanno; Aguirrezabal; Shashirekha; Babaei; Labadie; Hashemi; Emotional and personality features Justin; Niven; Russo; Hörtenhuemer; Espinosa; Cardaioli; Spezanno; Moreno-Sandoval; Embeddings Justin; Hörtenhuemer; Aguirrezabal; Ogaltsov; Shashirekha; Babaei; Labadie; Hashemi; Cilet; Majumder; ...BERT Spezanno; Kaushik; Baruah; Chien; PAN’20 * 9 teams have used Symanto API to obtain psycholinguistic and/or emotional features
  • 11. Approaches - Methods 11 Author Profiling SVM Pizarro; Vogel; Koloski; Espinosa; Fernández; Hashemi; Lichouri; Aguirrezabal; Fersini Logistic regression Buda; Vogel; Koloski; Hörtennhuemer; Pinnaparaju; Aguirrezabal; Manna Random Forest Cardaioli; Espinosa; Hashemi; Aguirrezabal; Sandoval; Manna Ensembles Ikade; Shrestha; Shashirekha; Niven Multilayer Perceptron Aguerrizabal NN with Dense Layer Baruah Fully-Connected NN Giglou CNN Chilet LSTM Majumder; Labadie bi-LSTM Saeed Ensemble (GRU + CNN) Bakhteev PAN’20
  • 14. Best results at PAN'20 14 Author Profiling PAN’20 Buda and Bolonyai - n-Grams - Stylistic features - Logistic Regression ensemble Pizarro - word and char n-grams - SVM
  • 15. Conclusions ● Several approaches to tackle the task: ○ n-Grams + SVM prevailing. ● Best results in English: ○ Over 67% on average. ○ Best (75%): Buda and Bolonyai - n-Grams + Stylistic features + Logistic Regression ensemble ● Best results in Spanish: ○ Over 73% on average. ○ Best (82%): Pizarro - char & word n-Grams + SVM. ● Error analysis: ○ English: ■ False positives (real news spreaders as fake news spreaders): 35.50% ■ False negatives (fake news spreaders as real news spreaders): 30.03% ○ Spanish: ■ False positives (real news spreaders as fake news spreaders): 20.23% ■ False negatives (fake news spreaders as real news spreaders): 35.09% Looking at the results, we can conclude: ● It is feasible to automatically identify Fake News Spreaders with high precision ○ ...even when only textual features are used. ● We have to bear in mind false positives since especially in English, they sum up to one-third of the total predictions, and misclassification might lead to ethical or legal implications. 15 Author Profiling PAN’20
  • 17. Industry at PAN (Author Profiling) 17 Author Profiling Organisation Sponsors PAN’20 This year, the winners of the task are (ex aequo): ● Jakab Buda and Flora Bolonyai, Eötvös Loránd University, Hungary ● Juan Pizarro, Chile
  • 18. 2021 -> HATE speech spreadeRS 18 Author Profiling PAN’20
  • 19. 19 Author Profiling On behalf of the author profiling task organisers: Thank you very much for participating and hope to see you next year!! PAN’20