ݺߣ

ݺߣShare a Scribd company logo
Analyzing Twitter data
Issues
  Challenges
    and
      Opportunities



RC33 Conference, Sydney Australia,
9-13 July 2012



Maurice Vergeer
m.vergeer@maw.ru.nl / www.mauricevergeer.nl / blog.mauricevergeer.nl
Radboud University Nijmegen, the Netherlands
   Many platform       Empty platform /
    -   Facebook         infrastructure
    -   Twitter          - Facility
    -   Linkedin
    -   Hyves
    -   RenRen
    -   Cyworld         User generated content
    -   Orkut            -   Text
    -   Youtube          -   Audio
    -   Flickr           -   Video
    -   Plurk            -   Pictures
    -   Sina Weibo
    -   Etc



Social media
Number of articles on politics, Internet and social media
                     180


                     160


                     140


                     120
Number of articles




                     100


                      80


                      60


                      40


                      20


                       0
                           1995   1996   1997    1998    1999   2000   2001   2002    2003    2004       2005   2006   2007   2008    2009    2010    2011      2012
                             Internet and politics (query 1)       Social media and politics (query 2)          Internet, social media and politics (query 3)


Source: Vergeer (in press / 2012) in New Media & Society
Focus on Twitter
The Netherlands



  A special case?
Social media presentation held at RC33 conference, Sydney, Australia
   Opportunities
    ◦ Methodological/technical
       Timeseries analysis
       Network analysis
        ◦ Actors
        ◦ Content
        ◦ Diffusion of information through onine social networks
        ◦ Social media activities

   Limitations
    ◦ Twitter
       Reliability of Twitter API




Outline
•   Within Twitter (using the API)
    • Username
    • Account creation data
    • # of followers
      • And the actual usernames of these followers
    • # of followers
      • And the actual usernames of those being followed
    • Tweet text

    • And many more (see dev.twitter.com)




Data sources
   Tweet
    ◦ Tweet text

    ◦ Whether or not it was a reply to another tweet
       To whom it was a reply (username/screenname and numerical
        userid)

    ◦ Whether or not it was a retweet (according to Twitter)
       Which tweet was retweeted (nunerical tweetid)
   Message of tweet

   Whether or not is was a directed tweet
    (sent to someone in particular)
    ◦ Identified by an @-sign


   Whether or not is was a retweet
    ◦ Identified by RT




Type of content
   Undirected tweet
    ◦ RCMP Commissioner appearing before Public Safety Cmte now.
      What a popular guy - he has his own paparazzi!

   Directed tweet
    ◦ Fantastic blog by my good friend @GlenPearson -
      http://bit.ly/hlAKXp #lpc

   Directed tweet to two usernames
    ◦ @miken32 @CBCEdmonton probably because that is NOT what I
      said--more commercially viable is different than not needed.

   Retweet
    ◦ RT @liberal_party: Think Durham deserves better than Bev Oda?
      Join @BobRaeMP for a rally tomorrow at 1pm http://lpc.ca/durham
      #cdnpoli #lpc




Tweet examples
Social media presentation held at RC33 conference, Sydney, Australia
   Traditional material
    ◦ Produced by professional actors
    ◦ Newspapers
    ◦ Public administration documents

   Social media
    ◦ Produced by
       professional actors
       general public




Content analysis of tweets
   Large quantities of data

   Word frequencies
    ◦ Identifying the most important words in the corpus
    ◦ Code these words into more general categories

   Switch to SPSS (or other type of data management tool)
    ◦ Search for the words in the actual tweets
    ◦ Assign tweet to a specific code

   Improvements in SPSS
    ◦ Compute command facilitates many new text operators
    ◦ Char.index, Char.substr, etc

   Alternative
    ◦ Regular expressions
    ◦ complex




Data extraction
   Publicly available data sources on
    parliament, election council

   Time series
    ◦ Identifying relevant societal/political events
      relevant for the study at hand
      Ex.1 temporarily shut down of election campaign
       due to passenger plane crash of Dutch airliner in
       Libia My 2010
      Ex.2 Deregistration of People s Political Power
       Party of Canada




External data sources
900


800


700


600


500


400


300


200


100


  0
      newspaper   broadcasting    radio    news agency    magazine   online only   local

                          institutional Twitter account       Personal Twitter account     9
Source: Vergeer & Hermans (forthcoming / 2013)
in Journal of Computer-Mediated Communication
Social media presentation held at RC33 conference, Sydney, Australia
1000




                               0
                                   100
                                         200
                                                           500
                                                                             800
                                                                                   900




                                               300
                                                     400
                                                                 600
                                                                       700
                 01-mei-2010
                 02-mei-2010
                 03-mei-2010
                 04-mei-2010
                 05-mei-2010
                 06-mei-2010
                 07-mei-2010
                 08-mei-2010
                 09-mei-2010




          CDA
PvdD
                 10-mei-2010
                 11-mei-2010
                 12-mei-2010




SGP
          PvdA
                 13-mei-2010
                 14-mei-2010
                 15-mei-2010




          SP
NN
                 16-mei-2010
                 17-mei-2010
                 18-mei-2010




          VVD
TON
                 19-mei-2010
                 20-mei-2010
                 21-mei-2010




          PVV
                 22-mei-2010




MenS
                 23-mei-2010
                 24-mei-2010



          GL
HNL
                 25-mei-2010
                 26-mei-2010
                 27-mei-2010
          CU

                 28-mei-2010
Partij1

                 29-mei-2010
                 30-mei-2010
                 31-mei-2010
          D66
Piraten




                 01-jun-2010
                 02-jun-2010
                 03-jun-2010
                 04-jun-2010
                 05-jun-2010
                 06-jun-2010
                 07-jun-2010
                 08-jun-2010
                 09-jun-2010
   Date and time

   For longitudinal analysis and cross-national comparisons
    ◦ take note of the time differences and correct if necessary.
        Time zones
        Daytime saving

   What to do with countries having multiple time zones?
    ◦ Depends on RQs
       Communication patterns: keep a single time zone
       Focus on individual daily patterns: adjust for time zones
   Total tweets by candidates, followers and followed:
    ◦ 4,536,854 tweets

   Breakdown
    ◦ Tweets among candidates:                            appr 2%
    ◦ Tweets to inner circles (followers or being followed)
       appr 18%
    ◦ Tweets to outer circle:                                  appr
      33%
    ◦ Tweets not directed to anyone in particular              appr
      49%

    ◦ Extracting users from tweets (@adresses)




Communication network analysis
 Communication network based on
  candidates identified in tweets
 Excluding the general public




Communication network analysis
Social media presentation held at RC33 conference, Sydney, Australia
   See http://tinyurl.com/blzajsl for
    animated version.
   Retrospective
    ◦ 3200 tweets back in time

   Cost technical
    ◦ Access to firehose for real time data




Limitations in data collection
   Date of tweet
    ◦ Minute fraction is time stamped with the wrong date
   Solution
    ◦ Estimate date and time using the tweetid

   Status of tweet as retweet
    ◦ RT
   Solution:
       Use text search operators to identify real retweets (“RT ”, “rt “)
        Also see http://tinyurl.com/bohhjzn

   Reply to tweets
    ◦ Only the first address is identified
   Solution
    ◦ Search for multiple @-addresses using text extraction methods



Reliability of data as provided by
the API
BIG DATA

The buzz word of these days
 Not gigabyte, ot terabytes,
 But petabytes and exabytes of data
 Only for the few
 Specific hardware requirements
    ◦ Computing power
    ◦ Data storage
   The data presented in this presentation
    ◦ Appr 4.5 million records equals appr 1
      gigabyte, not that Big
There is still so much to be done
with…
•   Focus on specific cases
     -political communication:
         politicians – candidates in elections
     -fan studies
         celebrities
         cast of popular soap opera’s
    ◦ -journalism studies
         journalists and newspapers





Focus on specific cases
 actor information
 information on societal events
 accumulate data over time using the
  same data structure
    ◦ Proonged analysis
    ◦ Multuple case studies, cross-national
      comparative analysis




Enrich existing Twitter data with
external data
   Traditional process (textbook approach)
    ◦ RQ -> research design

   Practice, particularly with secondaire (i.e. third party) data
    ◦ Data  RQ  research design
    ◦ Data  research design  RQ

Twitter
    Content analysis
    Longitudinal analysis
    Network analysis

   Different research designs requires different techniques
   Collaborate



Look at the data from different
angles, i.e. research designs
Thank you for your attention

More Related Content

Similar to Social media presentation held at RC33 conference, Sydney, Australia (20)

PDF
Msm2011 Twitter Citations
Katrin Weller
PDF
Presentatie CIO cafe
Martijn Kriens
PDF
IndiaSocial summit 2012 twitter analysis by team blogworks
Blogworks - helping create brands for the future
PDF
New Data Sources for Statistics, Social media: Twitter.
Piet J.H. Daas
PPTX
Social Media in Learning Environment
Zainal Abidin Sayadi
PPTX
CCI Winter School Social Media Presentation
Darryl Woodford
PPTX
CCI Winter School Workshop on Digital Methods and Social Media Analytics
Jean Burgess
PDF
Track G - Harry Verwayen
ePSI Platform
PDF
ICT delta presentatie sociale monitoring
Martijn Kriens
PDF
ICT Delta presentatie
Upstream
PPTX
Political Hyperlinking In Web 1.0 And Web 2.0 (21 May2009)
Han Woo PARK
PDF
Data Visualization at Twitter
Krist Wongsuphasawat
PPT
Themes and discussions from eight months in the French political blogosphere
Tim Highfield
PDF
networks inparliament-ccct
maartenmarx
PDF
Twitter as a data source for official statistics: first results.
Piet J.H. Daas
PPTX
Election 2010: The View from Twitter
Axel Bruns
PPTX
Mapping the Australian Twittersphere
Axel Bruns
PDF
Participation in political debates through blogging
Florian Buhl
PPTX
18th home blog_twitter_English (12OCT2010)
Han Woo PARK
PDF
Ausvotes
lchu125
Msm2011 Twitter Citations
Katrin Weller
Presentatie CIO cafe
Martijn Kriens
IndiaSocial summit 2012 twitter analysis by team blogworks
Blogworks - helping create brands for the future
New Data Sources for Statistics, Social media: Twitter.
Piet J.H. Daas
Social Media in Learning Environment
Zainal Abidin Sayadi
CCI Winter School Social Media Presentation
Darryl Woodford
CCI Winter School Workshop on Digital Methods and Social Media Analytics
Jean Burgess
Track G - Harry Verwayen
ePSI Platform
ICT delta presentatie sociale monitoring
Martijn Kriens
ICT Delta presentatie
Upstream
Political Hyperlinking In Web 1.0 And Web 2.0 (21 May2009)
Han Woo PARK
Data Visualization at Twitter
Krist Wongsuphasawat
Themes and discussions from eight months in the French political blogosphere
Tim Highfield
networks inparliament-ccct
maartenmarx
Twitter as a data source for official statistics: first results.
Piet J.H. Daas
Election 2010: The View from Twitter
Axel Bruns
Mapping the Australian Twittersphere
Axel Bruns
Participation in political debates through blogging
Florian Buhl
18th home blog_twitter_English (12OCT2010)
Han Woo PARK
Ausvotes
lchu125

Recently uploaded (20)

PDF
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
PPTX
Life and Career Skills Lesson 2.pptxProtective and Risk Factors of Late Adole...
ryangabrielcatalon40
PDF
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
PDF
Lean IP - Lecture by Dr Oliver Baldus at the MIPLM 2025
MIPLM
PDF
COM and NET Component Services 1st Edition Juval Löwy
kboqcyuw976
PDF
Supply Chain Security A Comprehensive Approach 1st Edition Arthur G. Arway
rxgnika452
PPTX
Iván Bornacelly - Presentation of the report - Empowering the workforce in th...
EduSkills OECD
PDF
Lesson 1 : Science and the Art of Geography Ecosystem
marvinnbustamante1
PPTX
Exploring Linear and Angular Quantities and Ergonomic Design.pptx
AngeliqueTolentinoDe
DOCX
Lesson 1 - Nature and Inquiry of Research
marvinnbustamante1
PPTX
Connecting Linear and Angular Quantities in Human Movement.pptx
AngeliqueTolentinoDe
PPTX
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
PDF
IMPORTANT GUIDELINES FOR M.Sc.ZOOLOGY DISSERTATION
raviralanaresh2
PDF
Cooperative wireless communications 1st Edition Yan Zhang
jsphyftmkb123
PDF
Quiz Night Live May 2025 - Intra Pragya Online General Quiz
Pragya - UEM Kolkata Quiz Club
PPTX
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
PPTX
PLANNING FOR EMERGENCY AND DISASTER MANAGEMENT ppt.pptx
PRADEEP ABOTHU
PDF
AI-assisted IP-Design lecture from the MIPLM 2025
MIPLM
PPTX
The Gift of the Magi by O Henry-A Story of True Love, Sacrifice, and Selfless...
Beena E S
PPTX
grade 8 week 2 ict.pptx. matatag grade 7
VanessaTaberlo
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
Life and Career Skills Lesson 2.pptxProtective and Risk Factors of Late Adole...
ryangabrielcatalon40
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
Lean IP - Lecture by Dr Oliver Baldus at the MIPLM 2025
MIPLM
COM and NET Component Services 1st Edition Juval Löwy
kboqcyuw976
Supply Chain Security A Comprehensive Approach 1st Edition Arthur G. Arway
rxgnika452
Iván Bornacelly - Presentation of the report - Empowering the workforce in th...
EduSkills OECD
Lesson 1 : Science and the Art of Geography Ecosystem
marvinnbustamante1
Exploring Linear and Angular Quantities and Ergonomic Design.pptx
AngeliqueTolentinoDe
Lesson 1 - Nature and Inquiry of Research
marvinnbustamante1
Connecting Linear and Angular Quantities in Human Movement.pptx
AngeliqueTolentinoDe
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
IMPORTANT GUIDELINES FOR M.Sc.ZOOLOGY DISSERTATION
raviralanaresh2
Cooperative wireless communications 1st Edition Yan Zhang
jsphyftmkb123
Quiz Night Live May 2025 - Intra Pragya Online General Quiz
Pragya - UEM Kolkata Quiz Club
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
PLANNING FOR EMERGENCY AND DISASTER MANAGEMENT ppt.pptx
PRADEEP ABOTHU
AI-assisted IP-Design lecture from the MIPLM 2025
MIPLM
The Gift of the Magi by O Henry-A Story of True Love, Sacrifice, and Selfless...
Beena E S
grade 8 week 2 ict.pptx. matatag grade 7
VanessaTaberlo
Ad

Social media presentation held at RC33 conference, Sydney, Australia

  • 1. Analyzing Twitter data Issues Challenges and Opportunities RC33 Conference, Sydney Australia, 9-13 July 2012 Maurice Vergeer m.vergeer@maw.ru.nl / www.mauricevergeer.nl / blog.mauricevergeer.nl Radboud University Nijmegen, the Netherlands
  • 2. Many platform  Empty platform / - Facebook infrastructure - Twitter - Facility - Linkedin - Hyves - RenRen - Cyworld  User generated content - Orkut - Text - Youtube - Audio - Flickr - Video - Plurk - Pictures - Sina Weibo - Etc Social media
  • 3. Number of articles on politics, Internet and social media 180 160 140 120 Number of articles 100 80 60 40 20 0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Internet and politics (query 1) Social media and politics (query 2) Internet, social media and politics (query 3) Source: Vergeer (in press / 2012) in New Media & Society
  • 5. The Netherlands A special case?
  • 7. Opportunities ◦ Methodological/technical  Timeseries analysis  Network analysis ◦ Actors ◦ Content ◦ Diffusion of information through onine social networks ◦ Social media activities  Limitations ◦ Twitter  Reliability of Twitter API Outline
  • 8. Within Twitter (using the API) • Username • Account creation data • # of followers • And the actual usernames of these followers • # of followers • And the actual usernames of those being followed • Tweet text • And many more (see dev.twitter.com) Data sources
  • 9. Tweet ◦ Tweet text ◦ Whether or not it was a reply to another tweet  To whom it was a reply (username/screenname and numerical userid) ◦ Whether or not it was a retweet (according to Twitter)  Which tweet was retweeted (nunerical tweetid)
  • 10. Message of tweet  Whether or not is was a directed tweet (sent to someone in particular) ◦ Identified by an @-sign  Whether or not is was a retweet ◦ Identified by RT Type of content
  • 11. Undirected tweet ◦ RCMP Commissioner appearing before Public Safety Cmte now. What a popular guy - he has his own paparazzi!  Directed tweet ◦ Fantastic blog by my good friend @GlenPearson - http://bit.ly/hlAKXp #lpc  Directed tweet to two usernames ◦ @miken32 @CBCEdmonton probably because that is NOT what I said--more commercially viable is different than not needed.  Retweet ◦ RT @liberal_party: Think Durham deserves better than Bev Oda? Join @BobRaeMP for a rally tomorrow at 1pm http://lpc.ca/durham #cdnpoli #lpc Tweet examples
  • 13. Traditional material ◦ Produced by professional actors ◦ Newspapers ◦ Public administration documents  Social media ◦ Produced by  professional actors  general public Content analysis of tweets
  • 14. Large quantities of data  Word frequencies ◦ Identifying the most important words in the corpus ◦ Code these words into more general categories  Switch to SPSS (or other type of data management tool) ◦ Search for the words in the actual tweets ◦ Assign tweet to a specific code  Improvements in SPSS ◦ Compute command facilitates many new text operators ◦ Char.index, Char.substr, etc  Alternative ◦ Regular expressions ◦ complex Data extraction
  • 15. Publicly available data sources on parliament, election council  Time series ◦ Identifying relevant societal/political events relevant for the study at hand  Ex.1 temporarily shut down of election campaign due to passenger plane crash of Dutch airliner in Libia My 2010  Ex.2 Deregistration of People s Political Power Party of Canada External data sources
  • 16. 900 800 700 600 500 400 300 200 100 0 newspaper broadcasting radio news agency magazine online only local institutional Twitter account Personal Twitter account 9
  • 17. Source: Vergeer & Hermans (forthcoming / 2013) in Journal of Computer-Mediated Communication
  • 19. 1000 0 100 200 500 800 900 300 400 600 700 01-mei-2010 02-mei-2010 03-mei-2010 04-mei-2010 05-mei-2010 06-mei-2010 07-mei-2010 08-mei-2010 09-mei-2010 CDA PvdD 10-mei-2010 11-mei-2010 12-mei-2010 SGP PvdA 13-mei-2010 14-mei-2010 15-mei-2010 SP NN 16-mei-2010 17-mei-2010 18-mei-2010 VVD TON 19-mei-2010 20-mei-2010 21-mei-2010 PVV 22-mei-2010 MenS 23-mei-2010 24-mei-2010 GL HNL 25-mei-2010 26-mei-2010 27-mei-2010 CU 28-mei-2010 Partij1 29-mei-2010 30-mei-2010 31-mei-2010 D66 Piraten 01-jun-2010 02-jun-2010 03-jun-2010 04-jun-2010 05-jun-2010 06-jun-2010 07-jun-2010 08-jun-2010 09-jun-2010
  • 20. Date and time  For longitudinal analysis and cross-national comparisons ◦ take note of the time differences and correct if necessary.  Time zones  Daytime saving  What to do with countries having multiple time zones? ◦ Depends on RQs  Communication patterns: keep a single time zone  Focus on individual daily patterns: adjust for time zones
  • 21. Total tweets by candidates, followers and followed: ◦ 4,536,854 tweets  Breakdown ◦ Tweets among candidates: appr 2% ◦ Tweets to inner circles (followers or being followed) appr 18% ◦ Tweets to outer circle: appr 33% ◦ Tweets not directed to anyone in particular appr 49% ◦ Extracting users from tweets (@adresses) Communication network analysis
  • 22.  Communication network based on candidates identified in tweets  Excluding the general public Communication network analysis
  • 24. See http://tinyurl.com/blzajsl for animated version.
  • 25. Retrospective ◦ 3200 tweets back in time  Cost technical ◦ Access to firehose for real time data Limitations in data collection
  • 26. Date of tweet ◦ Minute fraction is time stamped with the wrong date  Solution ◦ Estimate date and time using the tweetid  Status of tweet as retweet ◦ RT  Solution:  Use text search operators to identify real retweets (“RT ”, “rt “) Also see http://tinyurl.com/bohhjzn  Reply to tweets ◦ Only the first address is identified  Solution ◦ Search for multiple @-addresses using text extraction methods Reliability of data as provided by the API
  • 27. BIG DATA The buzz word of these days
  • 28.  Not gigabyte, ot terabytes,  But petabytes and exabytes of data
  • 29.  Only for the few  Specific hardware requirements ◦ Computing power ◦ Data storage  The data presented in this presentation ◦ Appr 4.5 million records equals appr 1 gigabyte, not that Big
  • 30. There is still so much to be done with…
  • 31. Focus on specific cases  -political communication:  politicians – candidates in elections  -fan studies  celebrities  cast of popular soap opera’s ◦ -journalism studies  journalists and newspapers  Focus on specific cases
  • 32.  actor information  information on societal events  accumulate data over time using the same data structure ◦ Proonged analysis ◦ Multuple case studies, cross-national comparative analysis Enrich existing Twitter data with external data
  • 33. Traditional process (textbook approach) ◦ RQ -> research design  Practice, particularly with secondaire (i.e. third party) data ◦ Data  RQ  research design ◦ Data  research design  RQ Twitter  Content analysis  Longitudinal analysis  Network analysis  Different research designs requires different techniques  Collaborate Look at the data from different angles, i.e. research designs
  • 34. Thank you for your attention