際際滷

際際滷Share a Scribd company logo
Be Nice, Be Respectful:
Protecting Online Spaces with Applied
Machine Learning
Quora ML Workshop: Sock Puppets and Hoaxes on the Web
Sockpuppets and Hoaxes on the Web
An Army of Me: Sockpuppets in Online Discussion Communities. S. Kumar, J.
Cheng, J. Leskovec and V.S. Subrahmanian. Proceedings of World Wide Web
Conference, 2017 (WWW 2017). Best Paper Award Honorable Mention.
Disinformation on the Web: Impact, Characteristics, and Detection of
Wikipedia Hoaxes. S. Kumar, R. West, J. Leskovec and V.S. Subrahmanian.
Proceedings of World Wide Web Conference, 2016 (WWW 2016)
Srijan Kumar
@srijankedia
Computer Science, Stanford University
Joint works with Robert West, Justin Cheng, Jure Leskovec, V.S. Subrahmanian
Web: A platform for everyone
 Web enables social interaction
 Web is no longer a static library that
people passively browse
 Web is a place where people:
 Act as prosumers, i.e., content producers and content
consumers
 Interact with other people:
 Internet forums, Blogs, Social networks, Twitter,
Wikis, Podcasts, 際際滷 sharing, Bookmark sharing, Product
reviews, Comments, 
Web allows...
but theres also a dark side
to the web
Time (2016); The Atlantic (2016); BBC (2015), Vanity Fair (2017), Digital Trends (2017)
Not everyone has good intentions
Web: Source of information
Web: Source of false information
Modeling and Detection of
Misbehavior and Misinformation on
the Web
This talk:
Challenges in analyzing malicious behavior
Data imbalance
Limited labels
Deceptive behavior
Smaller propotion of behavior
(< 10%) is malicious
Little known information about
malicious behavior
Malicious behavior tends to
masquerade as benign
VS Subrahmanian and Srijan Kumar. Predicting Human Behavior: The Next Frontiers. Science 2017.
Sockpuppets in online discussions
An Army of Me: Sockpuppets in Online Discussion Communities. S. Kumar, J.
Cheng, J. Leskovec and V.S. Subrahmanian. Proceedings of World Wide Web
Conference, 2017 (WWW 2017). Best Paper Award Honorable Mention.
Quora ML Workshop: Sock Puppets and Hoaxes on the Web
Eric_17 April 28 2013, 12AM
Thanks. I knew Marvel fans would try to flame me, but they
have nothing other than oh thats your opinion instead of
coming up with their own argument
Fellstrike April 29 2013, 6PM
Quit talking to yourself, **. Get back on your
meds if youre going to do that
bdiaz209 April 28 2013, 11PM
Possibly the best blog Ive ever read major props to you
bdiaz209 posts only on this discussion to
support and defend Eric_17
Sockpuppets in Discussions
Quora ML Workshop: Sock Puppets and Hoaxes on the Web
Data: Sockpuppets
2.9M
Users
2.1M
Articles
62M
Posts
Defining sockpuppets
No ground truth sockpuppet labels! (Surprise?!)
We adopt currently used definition from Wikipedia, after
statistical validation for our task, as follows:
Sockpuppets are accounts that post from the
same IP address in the same discussion very close
in time (15 min), in at least 3 different instances.
Note: we use the IP addresses for definition, but not detection
3,656
Sockpuppets
1,623
Puppetmasters
Characteristics of sockpuppets
How to compare sockpuppets & ordinary users?
For each sockpuppet, match an
ordinary user that makes
similar number of posts
on
similar discussions
We have to match!
Where do sockpuppets post?
Smoothzilla Feb 5 2013, 3PM
Thanks for your support!!!!
Falcon-X32 Feb 5 2013, 3PM
I agree. You are absolutely right!
jakey008 Feb 5 2013, 2PM
should have read the reviews first :(
ricobeans27 Feb 5 2013, 3PM
Couldnt agree more.
Interact more with each other
p < 10-3
Upvote each other more
p < 10-3
Relation between pair of sockpuppets
Do puppetmasters lead double lives?
Double life hypothesis:
Puppetmaster maintains distinct personality for the two sockpuppets
More simiar Less similar
Ordinary Sockpuppet 1 Sockpuppet 2
Similarity is measured as cosine similarity between user posts features: LIWC, sentiment,
number of words, etc.
Alternate hypothesis:
Puppetmaster operates both sockpuppets similarly
Less similar More similar
Ordinary Sockpuppet 1 Sockpuppet 2
Do puppetmasters lead double lives?
Similarity is measured as cosine similarity between user posts features: LIWC, sentiment,
number of words, etc.
Both sockpuppets are more similar to
each other
p < 10-3
Good sock/Bad sock not common
Non-sockpuppet Sockpuppet 1 Sockpuppet 2
Do puppetmasters lead double lives?
Why are sockpuppets created?
Only for deception?
Deceptiveness
Levenshtein distance between usernames
Numberofpairs
0 5 10 15 20
0100200300
Non-Pretenders Pretenders
Sock pairs Random pairs
srijan srijan2 srijan theRealBatman
2/31/3
Hypothesis: Deceptive sockpuppets of the same master have very different usernames.
srijan Feb 5 2013, 3PM
i agree.. these morons dont know a thing
theRealBatman Feb 5 2013, 3PM
YOU ARE STUPID AND A 
srijan Feb 5 2013, 2PM
best article i have read!!!
ricobeans27 Feb 5 2013, 3PM
But this article doesnt make any sense
More opinionated
p < 10-3
Swear more
p < 10-3
Downvoted and
reported more
p < 10-3
Pretender vs Non-pretender Sockpuppets
How are sockpuppets used?
Do sockpuppets always support
one another?
Neutral sockpuppets
theRealBatman Feb 5 2013, 3PM
why so?
srijan Feb 5 2013, 3PM
best article ever!
We quantify the amount of support by counting assenting, negation and dissenting words from LIWC
60%
Neutral
Supporter sockpuppets
theRealBatman Feb 5 2013, 3PM
Totally agree!!
srijan Feb 5 2013, 3PM
best article ever!
We quantify the amount of support by counting assenting, negation and dissenting words from LIWC
60%
Neutral
30%
Supporter
Dissenter sockpuppets
60%
Neutral
30%
Supporter
10%
Dissenter
theRealBatman Feb 5 2013, 3PM
I dont think so
srijan Feb 5 2013, 3PM
best article ever!
We quantify the amount of support by counting assenting, negation and dissenting words from LIWC
Probabilityofbeingapretender
Supportiveness and Deceptiveness
0.5
0.0
1.0
Dissenter
0.58
0.42
Neutral
0.70
0.30
Supporter
0.74
0.26
Pretender
Non-pretender
Deception is important to create
an illusion of public consensus
Detecting sockpuppets
Features
Post
Number of words,
characters, etc.,
LIWC counts,
Readability,
Sentiment,

Community
Number of upvotes and
downvotes,
Fraction of reported posts,
Is account reported,

Activity
Number of posts,
number of replies,
reciprocity of posts,
age of account,

Note: we are not using the IP based features
Is an account a sockpuppet?
Is an account a sockpuppet?
0.5 0.6 0.8 1.00.7 0.9
0.57
0.54
0.59
0.68
Post
Community
Activity
All
AUC
Baseline
Do two accounts belong to the same person?
Do two accounts belong to the same person?
0.5 0.6 0.8 1.00.7 0.9
0.80
0.56
0.86
0.91
AUC
Post
Community
Activity
All
Baseline
Sockpuppetry
Benign usage by
Non-pretender sockpuppets:
Primarily created to separate
interest, are respectful,
operate similarly, and are
neutral towards each other
Malicious usage by
Pretender sockpuppets:
Primarily created to create
an illusion of consensus,
are abusive, and support &
defend each other
Conclusion
Disinformation on the Web
Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia
Hoaxes. S. Kumar, R. West, J. Leskovec and V.S. Subrahmanian. Proceedings
of World Wide Web Conference, 2016 (WWW 2016)
Types of false information
Misinformation
honest mistake
Disinformation
deliberate lie to mislead
Hoax
deliberately
fabricated falsehood
made to masquerade
as truth
Wikipedia
Why Wikipedia?
The free encyclopedia that anyone can edit
Easy to add (false)
information
 Freely accessible
 Large reach
 Major source of
information for
many
Hoaxes on Wikipedia
Why care about false information?
Results
 64 public successful hoaxes
 Pair with similar legitimate articles
 320 random hoax/non-hoax pairs x 10 raters on Mech Turk
If a hoax looks like a genuine Wikipedia article, it is
assumed to be credible.
Accurate detection needs non-appearance features.
If humans are good in identifying false information, then we
dont need to worry.
50%
Random
66%
Human
86%
Classifier
Data: Wikipedia Hoaxes
Hoax article vs hoax facts
Data: Wikipedia Hoaxes
Hoax article vs hoax facts
21,218 hoax articles
Hoax lifecycle:
Disinformation on the Web
Impact
of hoaxes
Characteristics
of hoaxes
Detection
of hoaxes
Quantify their
impact?
What are the
hoaxes like?
Can we find
them?
Impact of hoaxes
The worst hoaxes are those which
(a) last for a long time,
(b) receive significant traffic,
(c) are relied upon by credible news media.
Jimmy Wales on Quora
Most hoaxes are caught soon, but
some hoaxes (~1%) are impactful
along all three axes
Disinformation on the Web
Impact
of hoaxes
Characteristics
of hoaxes
Detection
of hoaxes
Most hoaxes are
caught soon, but
some hoaxes are
impactful!
What are the
hoaxes like?
Can we find
them?
Successful hoax
pass patrol
survive for a month
viewed frequently
Failed hoax
flagged and
deleted during
patrol
Wrongly flagged
temporarily flagged
Legitimate
articles
never flagged
Hoax
Non-hoax
Characteristics of hoaxes
Hoax articles are longer, but they mostly have plain text
and have lesser web and wiki links.
Features:
o Plain-text length
o Plain-text-to-markup ratio
o Wiki-link density
o Web-link density
Appearance:
how the article looks
Link-network:
how the article
connects
Support:
how other articles
refer to it
Editor:
who created the
article
CC = 0
incoherent article
CC > 0
coherent article
Hoax articles are less coherent than non-hoax articles.
Appearance:
hoaxes mostly have
text and few
references.
Link-network:
how the article
connects
Support:
how other articles
refer to it
Editor:
who created the
article
Characteristics of hoaxes
Other articles rarely refer to the hoax article compared
to non-hoax articles. Whenever reference happens, it
was made recently by the hoaxster or an IP address
Features:
o Number of prior mentions
o Time since first mention
o Creator of first mention
Characteristics of hoaxes
Appearance:
hoaxes mostly have
text and few
references.
Link-network:
hoaxes have
incoherent
wikilinks.
Support:
how other articles refer
to it
Editor:
who created the
article
Hoax creators are more recently registered, and
have lesser editing experience.
Features:
o Creators account age
o Creators experience
Characteristics of hoaxes
Appearance:
hoaxes mostly have
text and few
references.
Link-network:
hoaxes have
incoherent
wikilinks.
Support:
hoaxes have few,
recent, suspicious
mentions.
Editor:
who created the
article
Disinformation on the Web
Impact
of hoaxes
Characteristics
of hoaxes
Detection
of hoaxes
Hoaxes are
different from
non-hoaxes in
many respects!
Most hoaxes are
caught soon, but
some hoaxes are
impactful!
Can we find
them?
Detection of hoaxes
Will a hoax get
past patrol?
Is an article
a hoax?
Is an article flagged
as hoax really one?
AUC = 71%
Appearance
features
AUC = 98%
Editor and
Network features
AUC = 86%
Editor and
support features
We discovered new hoaxes!
Steve Moertel
America popcorn
entrepreneur
6 years 11 months!
Flagged by us, deleted by Wikipedia administrators
Overall Conclusions
 Misbehavior and misinformation are prevalent on web, and
compromise safety and integrity
 Sockpuppets: Used for both benign and malicious purposes.
Malicious ones attempt to create an illusion of consensus
 Hoaxes: Successfully mislead readers, are impactful, and
have distinct characteristics
 Detection is important and feature engineering works
experimentally
You may also be interested in
 Tutorials on misbehavior and misinformation:
 Data-Driven Approaches towards Malicious Behavior Modeling. Jiang et al.,
SIGKDD 2017
 Antisocial Behavior on the Web: Characterization and Detection. Kumar et al.,
WWW 2017
 Vandals in Wikipedia
 VEWS: A Wikipedia Vandal Early Warning System. Kumar et al., SIGKDD 2015
 Language and deception
 Linguisitic Harbingers of Betrayal: A Case Study on an Online Strategic Game.
Niculae et al., ACL 2015
 Social network algorithm for troll detection
 Accurately Detecting Trolls in Slashdot Zoo via Decluttering. Kumar et al.,
ASONAM 2014
More details at: http://cs.stanford.edu/~srijan
MIS2 workshop at WSDM 2018
Submit your awesome papers, tools, demos, and more!
Full research papers, short papers, works in progress, extended
abstracts are welcome!
MIS2: Misinformation and Misbehavior
Mining on the Web
Feb 9, 2018 at Los Angeles, CA
Held in conjunction with WSDM 2018
Submissions due: Nov 20, 2017
Notifications due: Dec 14, 2017
Website: http://snap.stanford.edu/mis2/
Quora ML Workshop: Sock Puppets and Hoaxes on the Web
Ad

Recommended

sockpuppet-www2017
sockpuppet-www2017
Srijan Kumar
Disinformation on the Web: impact, characteristics and detection of Wikipedia...
Disinformation on the Web: impact, characteristics and detection of Wikipedia...
voginip
Sock Puppet.pptx
Sock Puppet.pptx
sweta dargad
Social networks in schools
Social networks in schools
Michael Young
misinformation-panel-hoax-fakenews
misinformation-panel-hoax-fakenews
Srijan Kumar
Crowdsourced fact checking
Crowdsourced fact checking
Liz Henry
Mac281 hoaxes and trolls lecture
Mac281 hoaxes and trolls lecture
Rob Jewitt
Quora ML Workshop: Maintaining High Quality User-Generated Content through Ma...
Quora ML Workshop: Maintaining High Quality User-Generated Content through Ma...
Quora
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Alliance
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
The Future of AI Agent Development Trends to Watch.pptx
The Future of AI Agent Development Trends to Watch.pptx
Lisa ward
Crypto Super 500 - 14th Report - June2025.pdf
Crypto Super 500 - 14th Report - June2025.pdf
Stephen Perrenod
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
Powering Multi-Page Web Applications Using Flow Apps and FME Data Streaming
Powering Multi-Page Web Applications Using Flow Apps and FME Data Streaming
Safe Software
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
2024 Trend Updates: What Really Works In SEO & Content Marketing
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
Storytelling For The Web: Integrate Storytelling in your Design Process
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta

More Related Content

Recently uploaded (20)

FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Alliance
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
The Future of AI Agent Development Trends to Watch.pptx
The Future of AI Agent Development Trends to Watch.pptx
Lisa ward
Crypto Super 500 - 14th Report - June2025.pdf
Crypto Super 500 - 14th Report - June2025.pdf
Stephen Perrenod
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
Powering Multi-Page Web Applications Using Flow Apps and FME Data Streaming
Powering Multi-Page Web Applications Using Flow Apps and FME Data Streaming
Safe Software
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Alliance
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
The Future of AI Agent Development Trends to Watch.pptx
The Future of AI Agent Development Trends to Watch.pptx
Lisa ward
Crypto Super 500 - 14th Report - June2025.pdf
Crypto Super 500 - 14th Report - June2025.pdf
Stephen Perrenod
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
Powering Multi-Page Web Applications Using Flow Apps and FME Data Streaming
Powering Multi-Page Web Applications Using Flow Apps and FME Data Streaming
Safe Software
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash

Featured (20)

2024 Trend Updates: What Really Works In SEO & Content Marketing
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
Storytelling For The Web: Integrate Storytelling in your Design Process
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
Artificial Intelligence, Data and Competition SCHREPEL June 2024 OECD dis...
Artificial Intelligence, Data and Competition SCHREPEL June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
2024 State of Marketing Report by Hubspot
2024 State of Marketing Report by Hubspot
Marius Sescu
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
Skeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
2024 Trend Updates: What Really Works In SEO & Content Marketing
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
Storytelling For The Web: Integrate Storytelling in your Design Process
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
2024 State of Marketing Report by Hubspot
2024 State of Marketing Report by Hubspot
Marius Sescu
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
Ad

Quora ML Workshop: Sock Puppets and Hoaxes on the Web

  • 1. Be Nice, Be Respectful: Protecting Online Spaces with Applied Machine Learning
  • 3. Sockpuppets and Hoaxes on the Web An Army of Me: Sockpuppets in Online Discussion Communities. S. Kumar, J. Cheng, J. Leskovec and V.S. Subrahmanian. Proceedings of World Wide Web Conference, 2017 (WWW 2017). Best Paper Award Honorable Mention. Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes. S. Kumar, R. West, J. Leskovec and V.S. Subrahmanian. Proceedings of World Wide Web Conference, 2016 (WWW 2016) Srijan Kumar @srijankedia Computer Science, Stanford University Joint works with Robert West, Justin Cheng, Jure Leskovec, V.S. Subrahmanian
  • 4. Web: A platform for everyone
  • 5. Web enables social interaction Web is no longer a static library that people passively browse Web is a place where people: Act as prosumers, i.e., content producers and content consumers Interact with other people: Internet forums, Blogs, Social networks, Twitter, Wikis, Podcasts, 際際滷 sharing, Bookmark sharing, Product reviews, Comments, Web allows...
  • 6. but theres also a dark side to the web
  • 7. Time (2016); The Atlantic (2016); BBC (2015), Vanity Fair (2017), Digital Trends (2017) Not everyone has good intentions
  • 8. Web: Source of information
  • 9. Web: Source of false information
  • 10. Modeling and Detection of Misbehavior and Misinformation on the Web This talk:
  • 11. Challenges in analyzing malicious behavior Data imbalance Limited labels Deceptive behavior Smaller propotion of behavior (< 10%) is malicious Little known information about malicious behavior Malicious behavior tends to masquerade as benign VS Subrahmanian and Srijan Kumar. Predicting Human Behavior: The Next Frontiers. Science 2017.
  • 12. Sockpuppets in online discussions An Army of Me: Sockpuppets in Online Discussion Communities. S. Kumar, J. Cheng, J. Leskovec and V.S. Subrahmanian. Proceedings of World Wide Web Conference, 2017 (WWW 2017). Best Paper Award Honorable Mention.
  • 14. Eric_17 April 28 2013, 12AM Thanks. I knew Marvel fans would try to flame me, but they have nothing other than oh thats your opinion instead of coming up with their own argument Fellstrike April 29 2013, 6PM Quit talking to yourself, **. Get back on your meds if youre going to do that bdiaz209 April 28 2013, 11PM Possibly the best blog Ive ever read major props to you bdiaz209 posts only on this discussion to support and defend Eric_17 Sockpuppets in Discussions
  • 17. Defining sockpuppets No ground truth sockpuppet labels! (Surprise?!) We adopt currently used definition from Wikipedia, after statistical validation for our task, as follows: Sockpuppets are accounts that post from the same IP address in the same discussion very close in time (15 min), in at least 3 different instances. Note: we use the IP addresses for definition, but not detection 3,656 Sockpuppets 1,623 Puppetmasters
  • 19. How to compare sockpuppets & ordinary users? For each sockpuppet, match an ordinary user that makes similar number of posts on similar discussions We have to match!
  • 21. Smoothzilla Feb 5 2013, 3PM Thanks for your support!!!! Falcon-X32 Feb 5 2013, 3PM I agree. You are absolutely right! jakey008 Feb 5 2013, 2PM should have read the reviews first :( ricobeans27 Feb 5 2013, 3PM Couldnt agree more. Interact more with each other p < 10-3 Upvote each other more p < 10-3 Relation between pair of sockpuppets
  • 22. Do puppetmasters lead double lives? Double life hypothesis: Puppetmaster maintains distinct personality for the two sockpuppets More simiar Less similar Ordinary Sockpuppet 1 Sockpuppet 2 Similarity is measured as cosine similarity between user posts features: LIWC, sentiment, number of words, etc.
  • 23. Alternate hypothesis: Puppetmaster operates both sockpuppets similarly Less similar More similar Ordinary Sockpuppet 1 Sockpuppet 2 Do puppetmasters lead double lives? Similarity is measured as cosine similarity between user posts features: LIWC, sentiment, number of words, etc.
  • 24. Both sockpuppets are more similar to each other p < 10-3 Good sock/Bad sock not common Non-sockpuppet Sockpuppet 1 Sockpuppet 2 Do puppetmasters lead double lives?
  • 25. Why are sockpuppets created? Only for deception?
  • 26. Deceptiveness Levenshtein distance between usernames Numberofpairs 0 5 10 15 20 0100200300 Non-Pretenders Pretenders Sock pairs Random pairs srijan srijan2 srijan theRealBatman 2/31/3 Hypothesis: Deceptive sockpuppets of the same master have very different usernames.
  • 27. srijan Feb 5 2013, 3PM i agree.. these morons dont know a thing theRealBatman Feb 5 2013, 3PM YOU ARE STUPID AND A srijan Feb 5 2013, 2PM best article i have read!!! ricobeans27 Feb 5 2013, 3PM But this article doesnt make any sense More opinionated p < 10-3 Swear more p < 10-3 Downvoted and reported more p < 10-3 Pretender vs Non-pretender Sockpuppets
  • 28. How are sockpuppets used? Do sockpuppets always support one another?
  • 29. Neutral sockpuppets theRealBatman Feb 5 2013, 3PM why so? srijan Feb 5 2013, 3PM best article ever! We quantify the amount of support by counting assenting, negation and dissenting words from LIWC 60% Neutral
  • 30. Supporter sockpuppets theRealBatman Feb 5 2013, 3PM Totally agree!! srijan Feb 5 2013, 3PM best article ever! We quantify the amount of support by counting assenting, negation and dissenting words from LIWC 60% Neutral 30% Supporter
  • 31. Dissenter sockpuppets 60% Neutral 30% Supporter 10% Dissenter theRealBatman Feb 5 2013, 3PM I dont think so srijan Feb 5 2013, 3PM best article ever! We quantify the amount of support by counting assenting, negation and dissenting words from LIWC
  • 34. Features Post Number of words, characters, etc., LIWC counts, Readability, Sentiment, Community Number of upvotes and downvotes, Fraction of reported posts, Is account reported, Activity Number of posts, number of replies, reciprocity of posts, age of account, Note: we are not using the IP based features
  • 35. Is an account a sockpuppet?
  • 36. Is an account a sockpuppet? 0.5 0.6 0.8 1.00.7 0.9 0.57 0.54 0.59 0.68 Post Community Activity All AUC Baseline
  • 37. Do two accounts belong to the same person?
  • 38. Do two accounts belong to the same person? 0.5 0.6 0.8 1.00.7 0.9 0.80 0.56 0.86 0.91 AUC Post Community Activity All Baseline
  • 39. Sockpuppetry Benign usage by Non-pretender sockpuppets: Primarily created to separate interest, are respectful, operate similarly, and are neutral towards each other Malicious usage by Pretender sockpuppets: Primarily created to create an illusion of consensus, are abusive, and support & defend each other Conclusion
  • 40. Disinformation on the Web Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes. S. Kumar, R. West, J. Leskovec and V.S. Subrahmanian. Proceedings of World Wide Web Conference, 2016 (WWW 2016)
  • 41. Types of false information Misinformation honest mistake Disinformation deliberate lie to mislead Hoax deliberately fabricated falsehood made to masquerade as truth Wikipedia
  • 42. Why Wikipedia? The free encyclopedia that anyone can edit Easy to add (false) information Freely accessible Large reach Major source of information for many
  • 44. Why care about false information? Results 64 public successful hoaxes Pair with similar legitimate articles 320 random hoax/non-hoax pairs x 10 raters on Mech Turk If a hoax looks like a genuine Wikipedia article, it is assumed to be credible. Accurate detection needs non-appearance features. If humans are good in identifying false information, then we dont need to worry. 50% Random 66% Human 86% Classifier
  • 45. Data: Wikipedia Hoaxes Hoax article vs hoax facts
  • 46. Data: Wikipedia Hoaxes Hoax article vs hoax facts 21,218 hoax articles Hoax lifecycle:
  • 47. Disinformation on the Web Impact of hoaxes Characteristics of hoaxes Detection of hoaxes Quantify their impact? What are the hoaxes like? Can we find them?
  • 48. Impact of hoaxes The worst hoaxes are those which (a) last for a long time, (b) receive significant traffic, (c) are relied upon by credible news media. Jimmy Wales on Quora Most hoaxes are caught soon, but some hoaxes (~1%) are impactful along all three axes
  • 49. Disinformation on the Web Impact of hoaxes Characteristics of hoaxes Detection of hoaxes Most hoaxes are caught soon, but some hoaxes are impactful! What are the hoaxes like? Can we find them?
  • 50. Successful hoax pass patrol survive for a month viewed frequently Failed hoax flagged and deleted during patrol Wrongly flagged temporarily flagged Legitimate articles never flagged Hoax Non-hoax
  • 51. Characteristics of hoaxes Hoax articles are longer, but they mostly have plain text and have lesser web and wiki links. Features: o Plain-text length o Plain-text-to-markup ratio o Wiki-link density o Web-link density Appearance: how the article looks Link-network: how the article connects Support: how other articles refer to it Editor: who created the article
  • 52. CC = 0 incoherent article CC > 0 coherent article Hoax articles are less coherent than non-hoax articles. Appearance: hoaxes mostly have text and few references. Link-network: how the article connects Support: how other articles refer to it Editor: who created the article Characteristics of hoaxes
  • 53. Other articles rarely refer to the hoax article compared to non-hoax articles. Whenever reference happens, it was made recently by the hoaxster or an IP address Features: o Number of prior mentions o Time since first mention o Creator of first mention Characteristics of hoaxes Appearance: hoaxes mostly have text and few references. Link-network: hoaxes have incoherent wikilinks. Support: how other articles refer to it Editor: who created the article
  • 54. Hoax creators are more recently registered, and have lesser editing experience. Features: o Creators account age o Creators experience Characteristics of hoaxes Appearance: hoaxes mostly have text and few references. Link-network: hoaxes have incoherent wikilinks. Support: hoaxes have few, recent, suspicious mentions. Editor: who created the article
  • 55. Disinformation on the Web Impact of hoaxes Characteristics of hoaxes Detection of hoaxes Hoaxes are different from non-hoaxes in many respects! Most hoaxes are caught soon, but some hoaxes are impactful! Can we find them?
  • 56. Detection of hoaxes Will a hoax get past patrol? Is an article a hoax? Is an article flagged as hoax really one? AUC = 71% Appearance features AUC = 98% Editor and Network features AUC = 86% Editor and support features
  • 57. We discovered new hoaxes! Steve Moertel America popcorn entrepreneur 6 years 11 months! Flagged by us, deleted by Wikipedia administrators
  • 58. Overall Conclusions Misbehavior and misinformation are prevalent on web, and compromise safety and integrity Sockpuppets: Used for both benign and malicious purposes. Malicious ones attempt to create an illusion of consensus Hoaxes: Successfully mislead readers, are impactful, and have distinct characteristics Detection is important and feature engineering works experimentally
  • 59. You may also be interested in Tutorials on misbehavior and misinformation: Data-Driven Approaches towards Malicious Behavior Modeling. Jiang et al., SIGKDD 2017 Antisocial Behavior on the Web: Characterization and Detection. Kumar et al., WWW 2017 Vandals in Wikipedia VEWS: A Wikipedia Vandal Early Warning System. Kumar et al., SIGKDD 2015 Language and deception Linguisitic Harbingers of Betrayal: A Case Study on an Online Strategic Game. Niculae et al., ACL 2015 Social network algorithm for troll detection Accurately Detecting Trolls in Slashdot Zoo via Decluttering. Kumar et al., ASONAM 2014 More details at: http://cs.stanford.edu/~srijan
  • 60. MIS2 workshop at WSDM 2018 Submit your awesome papers, tools, demos, and more! Full research papers, short papers, works in progress, extended abstracts are welcome! MIS2: Misinformation and Misbehavior Mining on the Web Feb 9, 2018 at Los Angeles, CA Held in conjunction with WSDM 2018 Submissions due: Nov 20, 2017 Notifications due: Dec 14, 2017 Website: http://snap.stanford.edu/mis2/