This document discusses using data mining and visualization techniques on different types of data including social, text, spatial, and event data. Key techniques mentioned include word frequency analysis, word ranking, identifying most frequent users, automatically categorizing words into tags or categories using hidden Markov models, and predicting the probability of events occurring over time.
7. Word frequency
Word use frequency
thierry
kristof
Rank slider
xi
lionel
christina
priya
philippe
monique
Word rank
laurent
gilles
tanguy
st辿phane
manon
suzanne
Word search
Most frequent users
8. Rank slider
Word search
bienvenue
Word frequency
bien
bisous
bienvenu
bientot
bitume
Word rank
bibli
biltiau
bienbizz
Word use frequency
bill4friends
Most frequent users
maman
st辿phanie
philippe
marielle
sophie
st辿phane
suzanne
manon
christina
priya
xi
tanguy
amaury
gilles
9. Rank slider
People search
ch
christina
Person frequency
charlene
charlotte
chantal
Person rank
Most used words
10. Automatically categorize words
Tags or categories
Hidden Transition Transition
Markov y1 y1 y2 y2 y2 y3 y3
Chain
Emission Emission Emission
* y1 the y2 dog y3 barks STOP
Sentence the barks
dog
23. Predicting whether and when
something happens
1
EVENTS
0.8
Cutoff
0.6
Probability
0.4
0.2
0
Time
Editor's Notes
#2: Student at the Institute for Advanced Analytics at NC State. Presentmy work by emphasizing what I believe is my strength, iethe combination ofdata mining and data visualization skills. Ive tried to select projects that I believe relate to the work being done at the lab. Presentation structured by data type: social data, text data, geo data, and finally event data.
#4: This project is a dynamic visualization of a network of relationships. In this case, relationship is defined as joint citation of a person in any New York Times article. A seed is chosen (here, Qaddafi) and the program searches through the NYTs public APIs for all the people mentioned with Qaddafi in the same article. The NYT API enables you to specify types of words to look for (key categories like people or organizations are automatically tagged in the articles).
#5: By clicking on a name, all the connections to that name are in turn spawned. So you can interactively explore the network of relationships.
#6: This is another type of social data visualization. This visualization emphasizes social influence. Indeed, this is a treemap of my Twitter followers where the size of someones profile picture is proportional to the number of followers that person has. Its visually appealing because it uses profile pictures and significance is straigthforward: size = measure of social influence. The only issue with this visualization is that some pictures have to be deformed to fit the requirement of a treemap for the overall to fit a rectangle. Such a treemap could be made interactive whereby clicking on a profile picture updates the mosaic with that persons followers. This could be a really nice way of navigating the Twitter graph.
#8: Text messages over a year. Dashboard where you can search either by word or by person. The slider enables you to select a word rank (i.e. the first, second, third, etc. most used word). And for a selected word you can see whos the most frequent user of that word. The graph plotting word frequency by word rank illustrates the famous Zipfs law, according to which the frequency of word in function of its rank is a power law, meaning that just a few words make the bulk of our vocabulary use.
#9: You can also search for a word by typing it with a real-time suggestion of words.
#10: And you can also do a search by person. In that case, you see a ranking of the most exchanged words with that person.A fun application for such data is for instance to calculate the vocabulary size of your friends and see whos the most learned. Thats what I did, but when I posted on the results on the Facebook wall of my friend with the smallest vocabulary size, she stopped talking to me for 2 days. So I dont recommend that course of action.
#11: This slide just illustrates an algorithm Ive been implementing in Python to automatically categorize words in a sentence. Basically, the model assumes that the category of previous words in a sequence is predictive of the category of the next word. And so you would choose the category or tag that is both most likely to follow a given sequence of tags and most likely to be associated with the word. Such algorithms can be really useful in sentiment analysis. You could adapt the model to tag the sentiment associated to a word.This can be applied to assess the mood of citizens based on online, real-time text data (typically Twitter).
#12: Spatial data is key in analyzing the life of cities. Here I present a series of experimental visualizations of spatial data.
#14: Population density. The idea here is that of deformed maps, ie maps that you deform to represent a more abstract reality. I wanted to visualize population density, so my first though was: when something dense, it is heavier; if blown by the wind, the light will fly off further than the dense. Hence this 3D visualization, except that, here, denser countries fly off higher than less dense countries.
#16: This project illustrates the application of geostatistics using the ArcGIS software. It represents the spread of dengue fever in the village of Pennathur, India (in 2001). More particularly the goal was to determine if there was a clustering phenomenon. The color of the dots reflect significance of the clustering phenomenon: the darker, the more clustered. The idea is simply to compare the actual distribution of the points to a random distribution: if the number of events found within a certain distance is greater than what we would expect under a random distribution, then the distribution is clustered. Statistical test to check whether departure from randomness is significant or not. Being able to determine the significance of a spatial clustering phenomenon can have many applications for analyzing the life of cities: do certain types of people or certain events (like crimes) happen in clustered manner? Why?Thousands of random distributions are generate and for each, a measure of spatial distribution is calculated. From all these simulations, one takes the highest and lowest values these are the envelope: if beyond that envelope, then significance.
#17: In this project, I used the app OpenPaths to track every 3 min my location through my phone. After a few months, I downloaded the data and used the Processing library UnfoldingMaps (by one of your former researcher) to visualized my spatial patterns in time using tile-based maps.
#18: Such projects can be useful to create more personalized maps: one could think of deformingmaps,ie changing distances between points to reflect a new spatial relationship between these points, like time to get there or frequency of travelling from one point to the other.Such projects can also be useful to build predictive models of where someone will be in the future.Defining new, more natural borders: clustering observations according to real-world data instead of arbitrary political boundaries. If we notice that a lot of people tend to overuse certain areas of the cities, while other people focus on some other areas, we could create boundaries that really separate areas based on peoples use of these areas. We could also use the spatial patterns to measure of the degree of clustering in spatial patterns: Highly spatially clustered individuals vs. Highly spatially dispersed individuals.
#19: Users can switch between various visualization modes to get different perspectives on the same data set: altitude vs (lat, lon)
#23: Spatial data is key in analyzing the life of cities. Here I present a series of experimental visualizations of spatial data.