Presentation win

•Download as PPTX, PDF•

0 likes•271 views

This document discusses using data mining and visualization techniques on different types of data including social, text, spatial, and event data. Key techniques mentioned include word frequency analysis, word ranking, identifying most frequent users, automatically categorizing words into tags or categories using hidden Markov models, and predicting the probability of events occurring over time.

DATA MINING
+
DATA VISUALIZATION

Cédric Warny

Word frequency

Word use frequency

thierry
kristof
Rank slider

xi
lionel
christina
priya
philippe
monique
Word rank
laurent
gilles
tanguy
stéphane
manon
suzanne
Word search

Most frequent users

Rank slider

Word search

bienvenue
Word frequency

bien
bisous
bienvenu
bientot
bitume
Word rank
bibli
biltiau
bienbizz
Word use frequency

bill4friends

Most frequent users
maman
stéphanie

philippe

marielle

sophie

stéphane

suzanne
manon
christina

priya
xi

tanguy

amaury
gilles

Rank slider

People search

ch
christina
Person frequency

charlene
charlotte
chantal

Person rank
Most used words

Automatically categorize words
“Tags” or categories
Hidden Transition Transition
Markov y1 y1  y2 y2 y2  y3 y3
Chain
Emission Emission Emission
* y1  the y2  dog y3  barks STOP

Sentence the barks
dog

Predicting whether and when
something happens
1

EVENTS
0.8
Cutoff

0.6
Probability

0.4

0.2

0
Time

Presentation win

1. DATA MINING + DATA VISUALIZATION Cédric Warny

2. Social data

6. Text data

7. Word frequency Word use frequency thierry kristof Rank slider xi lionel christina priya philippe monique Word rank laurent gilles tanguy stéphane manon suzanne Word search Most frequent users

8. Rank slider Word search bienvenue Word frequency bien bisous bienvenu bientot bitume Word rank bibli biltiau bienbizz Word use frequency bill4friends Most frequent users maman stéphanie philippe marielle sophie stéphane suzanne manon christina priya xi tanguy amaury gilles

9. Rank slider People search ch christina Person frequency charlene charlotte chantal Person rank Most used words

10. Automatically categorize words “Tags” or categories Hidden Transition Transition Markov y1 y1  y2 y2 y2  y3 y3 Chain Emission Emission Emission * y1  the y2  dog y3  barks STOP Sentence the barks dog

11. Spatial data

22. Event data

23. Predicting whether and when something happens 1 EVENTS 0.8 Cutoff 0.6 Probability 0.4 0.2 0 Time

Editor's Notes

#2: Student at the Institute for Advanced Analytics at NC State. Presentmy work by emphasizing what I believe is my strength, iethe combination ofdata mining and data visualization skills. I’ve tried to select projects that I believe relate to the work being done at the lab. Presentation structured by data type: social data, text data, geo data, and finally event data.
#4: This project is a dynamic visualization of a network of relationships. In this case, “relationship” is defined as joint citation of a person in any New York Times article. A seed is chosen (here, Qaddafi) and the program searches through the NYT’s public APIs for all the people mentioned with Qaddafi in the same article. The NYT API enables you to specify “types” of words to look for (key categories like people or organizations are automatically tagged in the articles).
#5: By clicking on a name, all the connections to that name are in turn spawned. So you can interactively explore the network of relationships.
#6: This is another type of social data visualization. This visualization emphasizes social influence. Indeed, this is a treemap of my Twitter followers where the size of someone’s profile picture is proportional to the number of followers that person has. It’s visually appealing because it uses profile pictures and significance is straigthforward: size = measure of social influence. The only issue with this visualization is that some pictures have to be “deformed” to fit the requirement of a treemap for the overall to fit a rectangle. Such a treemap could be made interactive whereby clicking on a profile picture updates the mosaic with that person’s followers. This could be a really nice way of navigating the Twitter graph.
#8: Text messages over a year. Dashboard where you can search either by word or by person. The slider enables you to select a word rank (i.e. the first, second, third, etc. most used word). And for a selected word you can see who’s the most frequent user of that word. The graph plotting word frequency by word rank illustrates the famous “Zipf’s law”, according to which the frequency of word in function of its rank is a power law, meaning that just a few words make the bulk of our vocabulary use.
#9: You can also search for a word by typing it with a real-time suggestion of words.
#10: And you can also do a search by person. In that case, you see a ranking of the most exchanged words with that person.A fun application for such data is for instance to calculate the vocabulary size of your friends and see who’s the “most learned”. That’s what I did, but when I posted on the results on the Facebook wall of my friend with the smallest vocabulary size, she stopped talking to me for 2 days. So I don’t recommend that course of action.
#11: This slide just illustrates an algorithm I’ve been implementing in Python to automatically categorize words in a sentence. Basically, the model assumes that the category of previous words in a sequence is predictive of the category of the next word. And so you would choose the category or tag that is both most likely to follow a given sequence of tags and most likely to be associated with the word. Such algorithms can be really useful in sentiment analysis. You could adapt the model to tag the sentiment associated to a word.This can be applied to assess the mood of citizens based on online, real-time text data (typically Twitter).
#12: Spatial data is key in analyzing the life of cities. Here I present a series of experimental visualizations of spatial data.
#14: Population density. The idea here is that of deformed maps, ie maps that you deform to represent a more abstract reality. I wanted to visualize population density, so my first though was: when something dense, it is heavier; if blown by the wind, the light will fly off further than the dense. Hence this 3D visualization, except that, here, denser countries fly off higher than less dense countries.
#16: This project illustrates the application of geostatistics using the ArcGIS software. It represents the spread of dengue fever in the village of Pennathur, India (in 2001). More particularly the goal was to determine if there was a clustering phenomenon. The color of the dots reflect significance of the clustering phenomenon: the darker, the more clustered. The idea is simply to compare the actual distribution of the points to a random distribution: if the number of events found within a certain distance is greater than what we would expect under a random distribution, then the distribution is clustered. Statistical test to check whether departure from randomness is significant or not. Being able to determine the significance of a spatial clustering phenomenon can have many applications for analyzing the life of cities: do certain types of people or certain events (like crimes) happen in clustered manner? Why?Thousands of random distributions are generate and for each, a measure of spatial distribution is calculated. From all these simulations, one takes the highest and lowest values –these are the “envelope”: if beyond that envelope, then significance.
#17: In this project, I used the app OpenPaths to track every 3 min my location through my phone. After a few months, I downloaded the data and used the Processing library UnfoldingMaps (by one of your former researcher) to visualized my spatial patterns in time using tile-based maps.
#18: Such projects can be useful to create more “personalized maps”: one could think of deformingmaps,ie changing distances between points to reflect a new spatial relationship between these points, like time to get there or frequency of travelling from one point to the other.Such projects can also be useful to build predictive models of where someone will be in the future.Defining new, more natural “borders”: clustering observations according to real-world data instead of arbitrary political boundaries. If we notice that a lot of people tend to overuse certain areas of the cities, while other people focus on some other areas, we could create boundaries that really separate areas based on people’s use of these areas. We could also use the spatial patterns to measure of the degree of clustering in spatial patterns: Highly spatially clustered individuals vs. Highly spatially dispersed individuals.
#19: Users can switch between various visualization modes to get different perspectives on the same data set: altitude vs (lat, lon)
#23: Spatial data is key in analyzing the life of cities. Here I present a series of experimental visualizations of spatial data.

�ݺ�ߣ

Presentation win

More Related Content

Presentation win

Editor's Notes