ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Insights Into Socio Politics Using
Data Analytics
A presentation by
About Politweet
• Researching the socio-economic and political
interest of Malaysians
• Developing analytical tools for Twitter
research
• Creating interactive, data-driven sites about
socio-economic and political topics
2#bdw2013 #bigdataMY
Today’s Talk
• Overview of our data pipeline
• Building timelines of historical events
• Measuring user opinion
• Measuring political partisanship
• Visualising voter migration
3#bdw2013 #bigdataMY
#bdw2013 #bigdataMY 4
Technical Details
• Runs on PostgreSQL, MySQL and PHP running
on Fedora Linux
• Events
– 6.3 million tweets from 1.6 million users
• Politicians’ mentions
– 5.5 million tweets from 385 thousand users
• Tweets related to American elections
– 12 million tweets from 2 million users
#bdw2013 #bigdataMY 5
BUILDING TIMELINES
#bdw2013 #bigdataMY 6
Building Timelines
• Tweets as historical record
• Bersih2 rally for electoral reforms
– July 9th 2011
– Goal: to reach Stadium Merdeka
– 85372 tweets from 19190 users
– 17452 mentions of locations collected for
investigative purposes
#bdw2013 #bigdataMY 7
Methodology
1. Identify most re-tweeted tweet for each hour
2. Identify peak time periods for event
3. Identify peak time periods for locations
4. View tweeted images for each hour
5. Watch videos that are supported by tweet
evidence
6. Combine all this information to establish a
timeline, cross-reference by reading tweets in
sequence to help separate rumour from fact
#bdw2013 #bigdataMY 8
#bersih2 Twitter Activity
#bdw2013 #bigdataMY 9
0
2000
4000
6000
8000
10000
12000
14000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Tweets/Users
Hour
July 9 Users
July 9 Tweets
#bersih2 Area Activity
#bdw2013 #bigdataMY 10
#bersih2 Timeline
• 8 AM – People making journey to city; reports
of roadblocks
• 9 AM – Arrests being made; police checking IC
at KTM and LRT
• 10 AM – More arrests being made at KL
Sentral, Masjid Jamek, Sogo; Large crowd
reported at Masjid Negara; False report of tear
gas fire at KLCC
#bdw2013 #bigdataMY 11
#bersih2 Timeline
• 11 AM – 236 people arrested so far; police
targeting people in bersih tees;
#bdw2013 #bigdataMY 12
#bersih2 Timeline
• 12 PM – More arrests; Crowds
gathered/moving at old railway station;
Central Market; Petaling Street
#bdw2013 #bigdataMY 13
#bdw2013 #bigdataMY 14
#bersih2 Timeline
• 1 PM – Tear-gas being fired near central
market; Water cannon being used; Massive
crowd gathered at Jalan Sultan, Puduraya; LRT
stations closed
#bdw2013 #bigdataMY 15
#bdw2013 #bigdataMY 16
#bersih2 Timeline
• 2 PM – Police action continues. The crowd at
Puduraya has broken up, 1 section proceeds to
Tung Shin hospital while the remainder heads to
Stadium Merdeka and KLCC.
• The earlier crowd that remained at Jalan Sultan
and Jalan Petaling were spared from similar
police action.
• Bersih and Pakatan leaders were tear-gassed at
KL Sentral, following an attempt to break through
the police blockade
#bdw2013 #bigdataMY 17
#bersih2 Timeline
• 2.30 PM – Police action continues. Tear gas is
fired into Tung Shin hospital grounds. Crowd
at Stadium Merdeka remains calm.
• 3 PM – More arrests being made of crowd
members at Tung Shin hospital. Crowd is
scattered.
• 4 PM – Crowd begins to disperse in some
areas. Large crowd reported at KLCC.
#bdw2013 #bigdataMY 18
#bersih2 Area Activity (revisited)
#bdw2013 #bigdataMY 19
Crowd Estimation
• Timeline establishes peak period
• Photos determine extents
• Google Maps used to measure area
• Crowd density estimated as average persons
per sq. ft.
• Final estimate was 45 – 50 thousand people
attended the rally
#bdw2013 #bigdataMY 20
Puduraya
Crowd Estimation Sample
Area covered: 127,536 sq.ft.
Estimated crowd: 31,884 people
#bdw2013 #bigdataMY 21
Himpunan Kebangkitan Rakyat
• People’s Uprising Rally
• January 12th 2013
• Applied the same techniques to build a
timeline
#bdw2013 #bigdataMY 22
Crowd estimation
#bdw2013 #bigdataMY 23
Crowd estimation
#bdw2013 #bigdataMY 24
MEASURING USER OPINION
#bdw2013 #bigdataMY 25
Measuring User Opinion
• Sentiment analysis on tweets
• Standard approaches
– Classify sentiment based on words or phrases
– Use Support Vector Machine (SVM) technique to
build topic-specific classifiers
• Demonstration: Tweets on #MansuhPTPTN
(Abolish PTPTN)
#bdw2013 #bigdataMY 26
Word-based Classifier
#bdw2013 #bigdataMY 27
neutral
positive
neutral
neutral
neutral
positive
neutral
neutral
neutral
neutral
neutral
negative
neutral
neutral
Identify keywords
to determine
sentiment
Result:
2 positive
11 neutral
1 negative
Word-based Classifier
#bdw2013 #bigdataMY 28
neutral
negative
positive
negative
neutral
neutral
neutral
positive
neutral
neutral
neutral
neutral
neutral
negative
negative
negativeneutral
neutral
negative
Lets add ‘ditahan’
and ‘blacklist’ to
list of negative
words
Result:
2 positive
5 neutral
6 negative
Word-based Classifier
• Word and phrase-based classifiers are good at
measuring ‘mood’ of a tweet
• Often result in large % of neutral sentiment
• Now we try Support Vector Machine (SVM)
#bdw2013 #bigdataMY 29
SVM Approach
#bdw2013 #bigdataMY 30
neutral
positive
neutral
positive
neutral
neutral
neutral
neutral
negative
neutral
neutral
negative
Certain phrases
are used by
supporters of the
proposal
Keywords
influence results positive
positive
positive
Result:
4 positive
9 neutral
1 negative
SVM Approach
• SVM improves results but requires training
sets of data
• Not practical for infrequent topics, such as the
PTPTN issue
• For regular issues, constant training required
to keep up to date
• Does not reliably tell us the final opinion of
the user
#bdw2013 #bigdataMY 31
Deducing Final Opinion
#bdw2013 #bigdataMY 32
neutral
positive
neutral
positive
neutral
neutral
neutral
neutral
negative
neutral
neutral
negative
If the last tweet
was positive, does
that imply positive
opinion?
positive
positive
positive
Our Methodology
1. Collect all tweets from users on a given topic
for a fixed length of time
2. A human examines tweets in sequence, on a
per-user basis
3. Based on the examination, determine the
final opinion of the user
4. Common reasons for support / opposing an
issue are noted
#bdw2013 #bigdataMY 33
Testing Our Method
#bdw2013 #bigdataMY 34
positive
positive
neutral
neutral
neutral
neutral
negative
Researcher
determines this
user supports the
proposal to
abolish PTPTN
The opposition to
the methods of
student activists is
noted.
This user is not
opposed to a
reduction in
interest
rate, instead of
abolishing outright
positive
positive
positive
positive
positive
positive
positive
positive
#bdw2013 #bigdataMY 35
Opinion-based Sentiment Analysis
• Pro
– More accurate measurement of sentiment than
standard approaches
– Offers details on why users oppose or support an issue
– Not influenced by large volume of tweets
• Con
– Time-consuming to prepare
– Requires researchers familiar with the language and
the issue
#bdw2013 #bigdataMY 36
Geo-located Sentiment Analysis
• Same methodology, but only on geo-located
tweets
• Results in sentiment based on location, and
how many in the area tweeted about the topic
• Demonstration: Himpunan Kebangkitan
Rakyat (People’s Uprising Rally) on January
12th
#bdw2013 #bigdataMY 37
#bdw2013 #bigdataMY 38
#bdw2013 #bigdataMY 39
Plans for the Future
• Build a Malay-language SVM to determine
sentiment on tweets
• Use sampling to estimate the opinion of the
Twitter user population
#bdw2013 #bigdataMY 40
POLITICAL PARTISANSHIP
#bdw2013 #bigdataMY 41
Measuring Political Partisanship
• Who we follow
• Who we mention
#bdw2013 #bigdataMY 42
#bdw2013 #bigdataMY 43
Who we follow
#bdw2013 #bigdataMY 44
Who we mention
#bdw2013 #bigdataMY 45
Facebook
#bdw2013 #bigdataMY 46
Voter migration
#bdw2013 #bigdataMY 47
Contact details
• Facebook : Fb.com/politweet
• Twitter : @politweetorg
• Email : admin@politweet.org
#bdw2013 #bigdataMY 48

More Related Content

Insights into socio politics using data analytics

  • 1. Insights Into Socio Politics Using Data Analytics A presentation by
  • 2. About Politweet • Researching the socio-economic and political interest of Malaysians • Developing analytical tools for Twitter research • Creating interactive, data-driven sites about socio-economic and political topics 2#bdw2013 #bigdataMY
  • 3. Today’s Talk • Overview of our data pipeline • Building timelines of historical events • Measuring user opinion • Measuring political partisanship • Visualising voter migration 3#bdw2013 #bigdataMY
  • 5. Technical Details • Runs on PostgreSQL, MySQL and PHP running on Fedora Linux • Events – 6.3 million tweets from 1.6 million users • Politicians’ mentions – 5.5 million tweets from 385 thousand users • Tweets related to American elections – 12 million tweets from 2 million users #bdw2013 #bigdataMY 5
  • 7. Building Timelines • Tweets as historical record • Bersih2 rally for electoral reforms – July 9th 2011 – Goal: to reach Stadium Merdeka – 85372 tweets from 19190 users – 17452 mentions of locations collected for investigative purposes #bdw2013 #bigdataMY 7
  • 8. Methodology 1. Identify most re-tweeted tweet for each hour 2. Identify peak time periods for event 3. Identify peak time periods for locations 4. View tweeted images for each hour 5. Watch videos that are supported by tweet evidence 6. Combine all this information to establish a timeline, cross-reference by reading tweets in sequence to help separate rumour from fact #bdw2013 #bigdataMY 8
  • 9. #bersih2 Twitter Activity #bdw2013 #bigdataMY 9 0 2000 4000 6000 8000 10000 12000 14000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Tweets/Users Hour July 9 Users July 9 Tweets
  • 11. #bersih2 Timeline • 8 AM – People making journey to city; reports of roadblocks • 9 AM – Arrests being made; police checking IC at KTM and LRT • 10 AM – More arrests being made at KL Sentral, Masjid Jamek, Sogo; Large crowd reported at Masjid Negara; False report of tear gas fire at KLCC #bdw2013 #bigdataMY 11
  • 12. #bersih2 Timeline • 11 AM – 236 people arrested so far; police targeting people in bersih tees; #bdw2013 #bigdataMY 12
  • 13. #bersih2 Timeline • 12 PM – More arrests; Crowds gathered/moving at old railway station; Central Market; Petaling Street #bdw2013 #bigdataMY 13
  • 15. #bersih2 Timeline • 1 PM – Tear-gas being fired near central market; Water cannon being used; Massive crowd gathered at Jalan Sultan, Puduraya; LRT stations closed #bdw2013 #bigdataMY 15
  • 17. #bersih2 Timeline • 2 PM – Police action continues. The crowd at Puduraya has broken up, 1 section proceeds to Tung Shin hospital while the remainder heads to Stadium Merdeka and KLCC. • The earlier crowd that remained at Jalan Sultan and Jalan Petaling were spared from similar police action. • Bersih and Pakatan leaders were tear-gassed at KL Sentral, following an attempt to break through the police blockade #bdw2013 #bigdataMY 17
  • 18. #bersih2 Timeline • 2.30 PM – Police action continues. Tear gas is fired into Tung Shin hospital grounds. Crowd at Stadium Merdeka remains calm. • 3 PM – More arrests being made of crowd members at Tung Shin hospital. Crowd is scattered. • 4 PM – Crowd begins to disperse in some areas. Large crowd reported at KLCC. #bdw2013 #bigdataMY 18
  • 19. #bersih2 Area Activity (revisited) #bdw2013 #bigdataMY 19
  • 20. Crowd Estimation • Timeline establishes peak period • Photos determine extents • Google Maps used to measure area • Crowd density estimated as average persons per sq. ft. • Final estimate was 45 – 50 thousand people attended the rally #bdw2013 #bigdataMY 20
  • 21. Puduraya Crowd Estimation Sample Area covered: 127,536 sq.ft. Estimated crowd: 31,884 people #bdw2013 #bigdataMY 21
  • 22. Himpunan Kebangkitan Rakyat • People’s Uprising Rally • January 12th 2013 • Applied the same techniques to build a timeline #bdw2013 #bigdataMY 22
  • 26. Measuring User Opinion • Sentiment analysis on tweets • Standard approaches – Classify sentiment based on words or phrases – Use Support Vector Machine (SVM) technique to build topic-specific classifiers • Demonstration: Tweets on #MansuhPTPTN (Abolish PTPTN) #bdw2013 #bigdataMY 26
  • 27. Word-based Classifier #bdw2013 #bigdataMY 27 neutral positive neutral neutral neutral positive neutral neutral neutral neutral neutral negative neutral neutral Identify keywords to determine sentiment Result: 2 positive 11 neutral 1 negative
  • 28. Word-based Classifier #bdw2013 #bigdataMY 28 neutral negative positive negative neutral neutral neutral positive neutral neutral neutral neutral neutral negative negative negativeneutral neutral negative Lets add ‘ditahan’ and ‘blacklist’ to list of negative words Result: 2 positive 5 neutral 6 negative
  • 29. Word-based Classifier • Word and phrase-based classifiers are good at measuring ‘mood’ of a tweet • Often result in large % of neutral sentiment • Now we try Support Vector Machine (SVM) #bdw2013 #bigdataMY 29
  • 30. SVM Approach #bdw2013 #bigdataMY 30 neutral positive neutral positive neutral neutral neutral neutral negative neutral neutral negative Certain phrases are used by supporters of the proposal Keywords influence results positive positive positive Result: 4 positive 9 neutral 1 negative
  • 31. SVM Approach • SVM improves results but requires training sets of data • Not practical for infrequent topics, such as the PTPTN issue • For regular issues, constant training required to keep up to date • Does not reliably tell us the final opinion of the user #bdw2013 #bigdataMY 31
  • 32. Deducing Final Opinion #bdw2013 #bigdataMY 32 neutral positive neutral positive neutral neutral neutral neutral negative neutral neutral negative If the last tweet was positive, does that imply positive opinion? positive positive positive
  • 33. Our Methodology 1. Collect all tweets from users on a given topic for a fixed length of time 2. A human examines tweets in sequence, on a per-user basis 3. Based on the examination, determine the final opinion of the user 4. Common reasons for support / opposing an issue are noted #bdw2013 #bigdataMY 33
  • 34. Testing Our Method #bdw2013 #bigdataMY 34 positive positive neutral neutral neutral neutral negative Researcher determines this user supports the proposal to abolish PTPTN The opposition to the methods of student activists is noted. This user is not opposed to a reduction in interest rate, instead of abolishing outright positive positive positive positive positive positive positive positive
  • 36. Opinion-based Sentiment Analysis • Pro – More accurate measurement of sentiment than standard approaches – Offers details on why users oppose or support an issue – Not influenced by large volume of tweets • Con – Time-consuming to prepare – Requires researchers familiar with the language and the issue #bdw2013 #bigdataMY 36
  • 37. Geo-located Sentiment Analysis • Same methodology, but only on geo-located tweets • Results in sentiment based on location, and how many in the area tweeted about the topic • Demonstration: Himpunan Kebangkitan Rakyat (People’s Uprising Rally) on January 12th #bdw2013 #bigdataMY 37
  • 40. Plans for the Future • Build a Malay-language SVM to determine sentiment on tweets • Use sampling to estimate the opinion of the Twitter user population #bdw2013 #bigdataMY 40
  • 42. Measuring Political Partisanship • Who we follow • Who we mention #bdw2013 #bigdataMY 42
  • 44. Who we follow #bdw2013 #bigdataMY 44
  • 45. Who we mention #bdw2013 #bigdataMY 45
  • 48. Contact details • Facebook : Fb.com/politweet • Twitter : @politweetorg • Email : admin@politweet.org #bdw2013 #bigdataMY 48

Editor's Notes

  1. Started in 2009. Initial goal was to build an archive of tweets by Malaysian politicians to serve an audience of bloggers, journalists and the politicians themselves. Get people more interested in what the government is doing, and expressing their opinions through social media
  2. What’s retweeted might be false. On the morning of the rally there were false reports of tear gas launch at KLCC
  3. The crowd is merging at Puduraya
  4. Scenes from Puduraya
  5. A mix of anti-establishment protests – bersih, anti-lynas
  6. 63976 – 78193 people
  7. The PTPTN issue was a proposal by PKR to abolish PTPTN and provide free education. It was a hot topic on Twitter in April 2012.
  8. This person supports the proposal
  9. Makes it worse. Problem with tweet-based classifiers is a tweet is part of a conversation
  10. Conflicting opinions cancel each other out, though certain phrases may have different weightage resulting in a net positive / negative
  11. Positive/negative in relation to the proposal
  12. Focus on individual opinion, not volume of mentions
  13. 138,784 tweets from 22,916 users shown on this map. 791 users tweeted about the event.
  14. We collected 138,784 tweets from 22,916 users located in Malaysia. 791 users tweeted about the rally. This is the sample used for sentiment analysis. 575 (72.69%) within Kuala Lumpur & Selangor216 (27.31%) outside Kuala Lumpur & Selangor
  15. Feb 2013 census
  16. Jan-Aug 2012 mentions
  17. 801K in Dec to 1.67 million 2 days ago PR (676220) Bipartisan(449200) BN (547800 )
  18. Size of node based on migration in. Color of node based on migration out, yelllow->orange->red->purple. Thickness of arrow based on number of voters. Subang received 15K, outgoing 5.7K