�ݺ�ߣ

Twitter Sentimental Analysis
Pagliaro Alessandro

Sentiment Analysis Workflow
Twitter Crawler

Per scaricare i tweet in maniera “legale” è possibile
utilizzare 3 tipologie di API, attraverso chiavi generate
durante la registazione a Twitter secondo il protocollo
OAuth.
∗Rest API : permettono di ricavare informazioni sul
proprio profilo (tweets, followers, info account ) senza
alcuna limitazione temporale.
∗Search API : permettono di cercare qualsiasi tweet con
limiti nell’indicizzazione (circa 7 giorni).
∗Streaming API : mi permettono di collezionare tutti i
tweet che vengono postati in tempo reale.
About Twitter

Tweepy è una liberia Python appositamente per le API di Twitter
Sorgente 1 : Search API

∗Le Twitter API ufficiali non permettono di accedere a tweets
più vecchi di circa una settimana. Alcuni tools forniti da terze
parti permettono di accedere all’indicizzazione completa di
Twitter (Gnip) ma richiedono dei costi proporzionati al
numero di tweets da scaricare.
∗Per evitare tutte queste limitazioni è stato usato un crawler.
Questo programma si basa sulla normale ricerca che è
possibile fare attraverso la Twitter Search da browser, infatti,
specificando una ricerca e facendo lo scroll della pagine un
JSON provider genera tutti i tweet secondo il loro ordine di
pubblicazione senza alcuna limitazione temporale.
Sorgente 2 : Twitter Crawler

∗ Query per hashtag e mentions (anche per emoticon) sui
candidati
∗ Recupero solo determinati campi (tweetid, username, tetx,
date, etc…) dagli oggetti JSON relativi ai tweets
∗ Memorizzazione in un CSV definendo il Dataset
Tweets

JSON Tweets
{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"in_reply_to_user_id"=>nil,
"in_reply_to_screen_name"=>nil,
"in_reply_to_status_id"=>nil
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"screen_name"=>"twitterapi",
"name"=>"Twitter API",
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com",
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
"profile_background_image_url"=>
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
"profile_background_tile"=>false,
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1,
"statuses_count"=>1628,
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
"geo_enabled"=>true,
"notifications"=>false,
"following"=>true,
"verified"=>true},
"contributors"=>[3191321],
"geo"=>nil,
"coordinates"=>nil,
"place"=>
{"id"=>"2b6ff8c22edd9576",
"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",
"name"=>"SoMa",
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood",
"country_code"=>"US",
"country"=>"The United States of America",
"bounding_box"=>
{"coordinates"=>
[[[-122.42284884, 37.76893497],
[-122.3964, 37.76893497],
[-122.3964, 37.78752897],
[-122.42284884, 37.78752897]]],
"type"=>"Polygon"}},
"source"=>"web"}
The tweet's unique ID. These
IDs are roughly sorted &
developers should treat them
as opaque (http://bit.ly/dCkppc).
Text of the tweet.
Consecutive duplicate tweets
are rejected. 140 character
max (http://bit.ly/4ud3he).
Tweet's
creation
date.
DEPRECATED
The ID of an existing tweet that
this tweet is in reply to. Won't
be set unless the author of the
referenced tweet is mentioned.
The screen name &
user ID of replied to
tweet author.
Truncated to 140
characters. Only
possible from SMS.
Theauthorofthetweet.This
embeddedobjectcangetoutofsync.
Theauthor's
userID.
The author's
user name.
The author's
screen name.
The author's
biography.
The author's
URL.
The author's "location". This is a free-form text field, and
there are no guarantees on whether it can be geocoded.
Rendering information
for the author. Colors
are encoded in hex
values (RGB).
The creation date
for this account.
Whether this account has
contributors enabled
(http://bit.ly/50npuu). Number of
favorites this
user has.
Numberoftweets
thisuserhas.
Number of
users this user
is following.The timezone and offset
(in seconds) for this user.
The user's selected
language.
Whether this user is protected
or not. If the user is protected,
then this tweet is not visible
except to "friends".
Number of
followers for
this user.
Whetherthisuserhasgeo
enabled(http://bit.ly/4pFY77).
DEPRECATED
in this context
Whether this user
has a verified badge.
Thegeotagonthistweetin
GeoJSON(http://bit.ly/b8L1Cp).
The contributors' (if any) user
IDs (http://bit.ly/50npuu).
DEPRECATED
The place associated with this
Tweet (http://bit.ly/b8L1Cp).
The place ID
The URL to fetch a detailed
polygon for this placeThe printable names of this place
The type of this
place - can be a
"neighborhood"
or "city"
The country this place is in
The bounding
box for this
place
The application
that sent this
tweet
Map of a Twitter Status Object
Raffi Krikorian <raffi@twitter.com>

∗ Hbase è un database distribuito column-oriented usato sulla
sommità di HDFS.
∗ Il suo data model colonnare è simile a Google Big Table ed è
progettato per fornire un acceso rapido in lettura/scrittura ad un
enorme quanità di dati memorizzati in HDFS (Hadoop File System).
∗ HBase fornisce una scalabilità orizzontale lineare.
∗ Possiede meccanismi di failover automatico (reliability, availability)
∗ Garantisce letture e scritture consistenti. (timestamp)
∗ E’ scritto in Java ed è possibile sfruttarlo attraverso API client Java.
∗ Fornisce meccanismi di replicazione dei dati. (HDFS – fattore 3 )
∗ Nasce perchè HDFS consente un accesso solo sequenziale ai dati.
HBase

HBase on HDFS
Pseudo-Distribuited Mode
∗ Simuliamo l’architettura HBase-HDFS di un sistema distribuito
su un unico nodo.

∗ Master-Slave : è costruito sulla sommità di HDFS, si adatta alla
sua architettura e ne sfrutta I benefici di scale out e affidabilità.
(HMaster-Namenode e HRegionServer-Datanode).
∗ Random access: usa meccanismi in-memory (MemStore) e di log
per garantire letture e scritture performanti e affidabili.
∗ Column–Oriented data model: memorizza le tabelle per column
family in ogni HStore, perciò è possibile definire un data model
che sfrutti I vantaggi di queste tecniche di memorizzazione.
∗ Auto-Sharding: all’interno di ogni HRegionServer vengono
memorizzate le tabelle per column family in range di row_id
ordinati all’interno delle HRegion, man mano che queste si
riempiono ne vengono definite di nuove su cui ridistribuire I dati.
HBase: Pro

∗ HBase fornisce 5 operazioni di base sui dati: Put, Get,
Delete, Scan e Increment.
∗ HBase fornisce sia delle API Java sia altre API di tipo
Thrift, Avro, REST e la Shell. (No SQL-like)
∗ API Java: Per l’inserimento dei dati è stata usata la
classe Put dopo aver creato la HTable, mentre per la
lettura la Scan (a cui è stato settato un filter sulla data,
per poter ottenere i tweet giornalieri su cui ciclare).
HBase client interface

∗ Il vocabolario è stato ricavato dalla repository GitHub
del progetto openNER (Open Polarity Enhanced
Name Entity Recognition) che punta al supporto dei
tool di natural processing language.
∗ https://github.com/opener-project/public-sentiment-lexicons/t
∗ Attraverso il WordCount in MapReduce sul Dataset
contente i tweets sono state recuperati gli hashtag e
le emoticon più utilizzate ed aggiunte al Vocabolario.
Sentiment: VocabolarioIta

∗ Apache OpenNLP è una libreria open-source di
machine learning che permette di elaborare testi in
linguaggio naturale.
∗ Alcuni delle funzionalità che supporta sono:
∗ Tokenization
∗ Sentence segmentation
∗ Part-of-speech tagging
∗ Named entity extraction
∗ Document categorizer
Sentiment: OpenNLP

∗ Questa funzionalità permette di classificare del testo in categorie
predefinite, questo è possibile attraverso un algoritmo di
massima entropia (MaxEnt).
1. L’entropia viene usata nel contesto della teoria dell’informazione
per misurare l’incertezza del contenuto informativo, più ho
incertezza più ho informazione
2. L’algoritmo prevede quindi di addestrare il nostro modello con un
training set etichettato (VocabolarioIta) con un buon livello di
entropia.
3. Nella fase di addestramento del modello possiamo regolare alcuni
paramenti (cutoff, iterations).
4.Alla fine dell’addestramento, il modello analizzando il singolo
tweet è capace di classificarlo (positive, negative).
openNLP – Document Categorizer

openNLP – Document Categorizer

Sentiment: Algoritmo
∗ Selezionando i tweet per ogni candidato e per ogni giorno, è stato
determinato:
∗ Un sentimento non normalizzato e pesato per retweets
∗ Un sentimento normalizzato e pesato per retweets
∗ Se non ci sono retweets la formula si riduce a

MicroStrategy
∗ MicroStrategy è un provider di soluzioni di BI, mobile software e
cloud-based services.
∗ Nel Magic Quadrant di Gartnet è collocato tra i Visionaries.
∗ Fornisce solo soluzioni di BI “puro”.
∗ Si basa su un sistema di OLAP relazionale (ROLAP) che permette
agli utenti di analizzare l’intero database relazionale a tutti i
∗ La sua piattaforma di BI permette di ottenere dashboard
interattive, scorecard, report altamente formattati, query ad
hoc, soglie e alert.

Risultati : Eventi influenti
∗ Per analizzare i risultati è possibile fare riferimento ad una serie
di eventi che possono aver condizionato il sentimento in uno dei
due modi
∗ 9 maggio - DeMagistris: “Mi hanno strappato la toga ma non possono
strapparmi l'anima. Renzi, vattene a casa! Devi avere paura! Ti devi
cagare sotto”.
∗ 25/30 maggio - http://www.liberiamonapoli.it/pure-de-magistris-tiene-
famiglia/
∗ 27 maggio - Berlusconi a Napoli per Lettieri
∗ 3 Giugno - Renzi a Napoli per Valeria Valente

�ݺ�ߣ

Sentiment analysis

More Related Content

Sentiment analysis

Editor's Notes