際際滷

際際滷Share a Scribd company logo
Czech Twitter
         as
a data mining source




           Josef lerka, WebExpo 2009
Twitter.com
Twitter is a free social networking and micro-
blogging service that enables its users to send and
read messages knows as tweets.

Tweets are text-based posts of up to 140 characters
displayed on the author兵s pro鍖le page and delivered
to the author兵s subscribers who are known as
followers
                                         (Wikipedia)
What is data
mining and how is
it connected with
Twitter?
Data mining is the process of extracting
patterns from data. As more data are gathered,
data mining is becoming an increasingly
important tool to transform there data into
information
                                   (Wikipedie)

Different variations would be text mining,
web mining including semantic analysis
Twitter Data mining


- makes it easy to use all data mining methods

- adds 併併time兵兵 & 併併space兵兵

- provides real-time picture

- easy connects with other social media (about 30%
users have unique nickname for all platforms)
Data mining - different methods

- different variations of semantic distance of
similarities (Jaccard index)

- frequency analysis based on time (are people
happier in the morning or in the evening?)

- frequency analysis based on location

- one of the results -> identi鍖cation of opinion
makers in the social networks
Transmission News
using different APIs to
get more information
Transmission News = 5 APIs in one
                  www. transnews.tw

   5x Twitter News Service accounts
   1x Yahoo Geo
   1x Google Search AJAX
   1x Google Maps
   1x Open Calais
   and a little bit of Wikipedia
www.transnews.tw
This brings us to the
downside of Twitter API
API searches are limited
to the number of
inquiries
Even worse, their data
doesn兵t go farther than
1.5 weeks in the past
Hence the development
of Sparrow 1.0
Czech Twitter by the
numbers
Sparrow 1.0
                          application methodology

- archives all tweets located in Czech republic in
  hourly interval via Twitter API (starting June 2009)

- automatically detects language

- identi鍖es Czech tweets with word count dictionary

- compares Czech Twitter statistics with foreign
  countries兵 statistics
Sparrow 1.0 - June 2009 stats
- about 700.000 tweets

- created by 10,628 unique users who enabled their
  geo-location (CZ) or tweeted in Czech
- 5.880 users tweeted at least once in Czech

- 2.424 Czech writing users revealed their geo-location
  (usually about 30% of users do that)
How many Twitter users are in the Czech republic?

    Between 6,000 - 8,000 users write in Czech

      1.000 a転 2.000 users prefer English

                  There are about
         10,000 active Twitter users in CR
What兵s the Czech Twitter dynamics?

 Every four weeks the number of users with at
        least one tweet rises about 25%


The number of active users rises 3-5% each week


Absolute number of tweets rises about 25% too
What characteristics do Czech tweets have?



2 % are RT
4 % use a 併兵#兵兵
21.5 % represent reply and conversation
34.6 % includes a link
What languages do
people in the CR use for
tweeting?
Let兵s see that graph

English   Czech         Slovak   Deutsch   others



                  13%
            4%
           7%
                                 44%




                33%
Geo-location breakdown of Tweets among big cities in CR
                  (July-August 2009)                             6. Liberec 14178x
                                                                 en - 9561x ~ 67.44%
1. Praha 247685x
                                                                 cs - 2864x ~ 20.20%
en - 116580x ~ 47.07%
                                                                 sk - 462x ~ 3.26%
cs - 79957x ~ 32.28%    9 cities         Prague         others
sk - 16449x ~ 6.64%
                                                                 7. esk辿 Budjovice
                                                                 6219x
2. Brno 37021x
                                                                 cs - 2589x ~ 41.63%
en - 16104x ~ 43.50%
                                                                 en - 1386x ~ 22.29%
cs - 14753x ~ 39.85%
                                                                 es - 551x ~ 8.86%
sk - 3360x ~ 9.08%
                                                                 8. Hradec Kr叩lov辿
3. Ostrava 23836x
                                                                 11888x
en - 13885x ~ 58.25%                              25%            cs - 4696x ~ 39.50%
cs - 5306x ~ 22.26%                30%                           en - 4400x ~ 37.01%
pl - 1638x ~ 6.87%
                                                                 de - 1113x ~ 9.36%
4. Plze 13681x
                                                                 9. st鱈 nad Labem
en - 9160x ~ 66.95%
                                                                 12016x
cs - 2206x ~ 16.12%
                                                                 en - 4266x ~ 35.50%
fr - 417x ~ 3.05%
                                                                 de - 2882x ~ 23.98%
                                                                 cs - 2570x ~ 21.39%
5. Olomouc 10754
en - 4619x ~ 42.95%
                                                                 10. Pardubice 5576x
cs - 3062x ~ 28.47%
                                                                 cs - 2718x ~ 48.74%
pt - 999x ~ 9.29%
                                          45%                    en - 1831x ~ 32.84%
                                                                 sk - 414x ~ 7.42%
And what about
併併when?兵兵
And why does it
matter?
This is what we兵ve learned in a few months:

- Czechs tweet most often on Tuesday or Thursday, and
the least in Saturday
 Around the world the most popular day is Tuesday, and the
least is Sunday

- The number of tweets rises steadily from the beginning to
the end of the month, then falls and begins rising again.
That means people tweet more at the end of the month
than at the beginning.
Prediction of the presence
Google vs. Twitter
MADONNA
IN PRAGUE
 13. 8. 2009
Madonna - August 2009 - Google search
Madonna - August 2009 - Czech Twitter
Sometimes Twitter is quicker & can predict future
                   searches
September 17th,
    Ostrava
Rammstein - August 2009 - Google search
Rammstein - August 2009 - Czech Twitter




                         17.9.2009
Thanks for your attention.
   Questions? Ideas?
   slerka@ataxo.com

More Related Content

Twitter as a data mining source

  • 1. Czech Twitter as a data mining source Josef lerka, WebExpo 2009
  • 2. Twitter.com Twitter is a free social networking and micro- blogging service that enables its users to send and read messages knows as tweets. Tweets are text-based posts of up to 140 characters displayed on the author兵s pro鍖le page and delivered to the author兵s subscribers who are known as followers (Wikipedia)
  • 3. What is data mining and how is it connected with Twitter?
  • 4. Data mining is the process of extracting patterns from data. As more data are gathered, data mining is becoming an increasingly important tool to transform there data into information (Wikipedie) Different variations would be text mining, web mining including semantic analysis
  • 5. Twitter Data mining - makes it easy to use all data mining methods - adds 併併time兵兵 & 併併space兵兵 - provides real-time picture - easy connects with other social media (about 30% users have unique nickname for all platforms)
  • 6. Data mining - different methods - different variations of semantic distance of similarities (Jaccard index) - frequency analysis based on time (are people happier in the morning or in the evening?) - frequency analysis based on location - one of the results -> identi鍖cation of opinion makers in the social networks
  • 7. Transmission News using different APIs to get more information
  • 8. Transmission News = 5 APIs in one www. transnews.tw 5x Twitter News Service accounts 1x Yahoo Geo 1x Google Search AJAX 1x Google Maps 1x Open Calais and a little bit of Wikipedia
  • 10. This brings us to the downside of Twitter API
  • 11. API searches are limited to the number of inquiries Even worse, their data doesn兵t go farther than 1.5 weeks in the past
  • 13. Czech Twitter by the numbers
  • 14. Sparrow 1.0 application methodology - archives all tweets located in Czech republic in hourly interval via Twitter API (starting June 2009) - automatically detects language - identi鍖es Czech tweets with word count dictionary - compares Czech Twitter statistics with foreign countries兵 statistics
  • 15. Sparrow 1.0 - June 2009 stats - about 700.000 tweets - created by 10,628 unique users who enabled their geo-location (CZ) or tweeted in Czech - 5.880 users tweeted at least once in Czech - 2.424 Czech writing users revealed their geo-location (usually about 30% of users do that)
  • 16. How many Twitter users are in the Czech republic? Between 6,000 - 8,000 users write in Czech 1.000 a転 2.000 users prefer English There are about 10,000 active Twitter users in CR
  • 17. What兵s the Czech Twitter dynamics? Every four weeks the number of users with at least one tweet rises about 25% The number of active users rises 3-5% each week Absolute number of tweets rises about 25% too
  • 18. What characteristics do Czech tweets have? 2 % are RT 4 % use a 併兵#兵兵 21.5 % represent reply and conversation 34.6 % includes a link
  • 19. What languages do people in the CR use for tweeting?
  • 20. Let兵s see that graph English Czech Slovak Deutsch others 13% 4% 7% 44% 33%
  • 21. Geo-location breakdown of Tweets among big cities in CR (July-August 2009) 6. Liberec 14178x en - 9561x ~ 67.44% 1. Praha 247685x cs - 2864x ~ 20.20% en - 116580x ~ 47.07% sk - 462x ~ 3.26% cs - 79957x ~ 32.28% 9 cities Prague others sk - 16449x ~ 6.64% 7. esk辿 Budjovice 6219x 2. Brno 37021x cs - 2589x ~ 41.63% en - 16104x ~ 43.50% en - 1386x ~ 22.29% cs - 14753x ~ 39.85% es - 551x ~ 8.86% sk - 3360x ~ 9.08% 8. Hradec Kr叩lov辿 3. Ostrava 23836x 11888x en - 13885x ~ 58.25% 25% cs - 4696x ~ 39.50% cs - 5306x ~ 22.26% 30% en - 4400x ~ 37.01% pl - 1638x ~ 6.87% de - 1113x ~ 9.36% 4. Plze 13681x 9. st鱈 nad Labem en - 9160x ~ 66.95% 12016x cs - 2206x ~ 16.12% en - 4266x ~ 35.50% fr - 417x ~ 3.05% de - 2882x ~ 23.98% cs - 2570x ~ 21.39% 5. Olomouc 10754 en - 4619x ~ 42.95% 10. Pardubice 5576x cs - 3062x ~ 28.47% cs - 2718x ~ 48.74% pt - 999x ~ 9.29% 45% en - 1831x ~ 32.84% sk - 414x ~ 7.42%
  • 23. This is what we兵ve learned in a few months: - Czechs tweet most often on Tuesday or Thursday, and the least in Saturday Around the world the most popular day is Tuesday, and the least is Sunday - The number of tweets rises steadily from the beginning to the end of the month, then falls and begins rising again. That means people tweet more at the end of the month than at the beginning.
  • 24. Prediction of the presence Google vs. Twitter
  • 26. Madonna - August 2009 - Google search
  • 27. Madonna - August 2009 - Czech Twitter
  • 28. Sometimes Twitter is quicker & can predict future searches
  • 29. September 17th, Ostrava
  • 30. Rammstein - August 2009 - Google search
  • 31. Rammstein - August 2009 - Czech Twitter 17.9.2009
  • 32. Thanks for your attention. Questions? Ideas? slerka@ataxo.com