際際滷

際際滷Share a Scribd company logo
DATA MINING
        +
DATA VISUALIZATION




                     C辿dric Warny
Social data
Presentation win
Presentation win
Presentation win
Text data
Word frequency


            Word use frequency


  thierry
  kristof
                                                                                       Rank slider




       xi
   lionel
christina
    priya
 philippe
monique
                                            Word rank
 laurent
   gilles
  tanguy
st辿phane
  manon
suzanne
                                                                         Word search




                      Most frequent users
Rank slider


                                                                                                                                                                           Word search


                                                                                                                                                                           bienvenue
Word frequency




                                                                                                                                                                           bien
                                                                                                                                                                           bisous
                                                                                                                                                                           bienvenu
                                                                                                                                                                           bientot
                                                                                                                                                                           bitume
                                                                                              Word rank
                                                                                                                                                                           bibli
                                                                                                                                                                           biltiau
                                                                                                                                                                           bienbizz
                 Word use frequency




                                                                                                                                                                           bill4friends

                                                                                                                                                                            Most frequent users
                                                                   maman
                                                  st辿phanie




                                                                           philippe


                                                                                               marielle


                                                                                                                   sophie




                                                                                                                                              st辿phane


                                                                                                                                                                 suzanne
                                                                                                                                                         manon
                                      christina




                                                                                      priya
                                                              xi




                                                                                                          tanguy




                                                                                                                                     amaury
                                                                                                                            gilles
Rank slider


                                                    People search

                                                    ch
                                                     christina
Person frequency




                                                     charlene
                                                     charlotte
                                                     chantal



                                     Person rank
                   Most used words
Automatically categorize words
                           Tags or categories
Hidden                    Transition              Transition
Markov            y1       y1  y2       y2        y2  y3        y3
 Chain
               Emission                Emission                 Emission
           *   y1  the                y2  dog                y3  barks   STOP



Sentence       the                                                barks
                                        dog
Spatial data
Presentation win
Presentation win
Presentation win
Presentation win
Presentation win
Presentation win
Presentation win
Presentation win
Presentation win
Presentation win
Event data
Predicting whether and when
                           something happens
               1

                                  EVENTS
              0.8
                    Cutoff


              0.6
Probability




              0.4




              0.2




               0
                                  Time

More Related Content

Presentation win

  • 1. DATA MINING + DATA VISUALIZATION C辿dric Warny
  • 7. Word frequency Word use frequency thierry kristof Rank slider xi lionel christina priya philippe monique Word rank laurent gilles tanguy st辿phane manon suzanne Word search Most frequent users
  • 8. Rank slider Word search bienvenue Word frequency bien bisous bienvenu bientot bitume Word rank bibli biltiau bienbizz Word use frequency bill4friends Most frequent users maman st辿phanie philippe marielle sophie st辿phane suzanne manon christina priya xi tanguy amaury gilles
  • 9. Rank slider People search ch christina Person frequency charlene charlotte chantal Person rank Most used words
  • 10. Automatically categorize words Tags or categories Hidden Transition Transition Markov y1 y1 y2 y2 y2 y3 y3 Chain Emission Emission Emission * y1 the y2 dog y3 barks STOP Sentence the barks dog
  • 23. Predicting whether and when something happens 1 EVENTS 0.8 Cutoff 0.6 Probability 0.4 0.2 0 Time

Editor's Notes

  • #2: Student at the Institute for Advanced Analytics at NC State. Presentmy work by emphasizing what I believe is my strength, iethe combination ofdata mining and data visualization skills. Ive tried to select projects that I believe relate to the work being done at the lab. Presentation structured by data type: social data, text data, geo data, and finally event data.
  • #4: This project is a dynamic visualization of a network of relationships. In this case, relationship is defined as joint citation of a person in any New York Times article. A seed is chosen (here, Qaddafi) and the program searches through the NYTs public APIs for all the people mentioned with Qaddafi in the same article. The NYT API enables you to specify types of words to look for (key categories like people or organizations are automatically tagged in the articles).
  • #5: By clicking on a name, all the connections to that name are in turn spawned. So you can interactively explore the network of relationships.
  • #6: This is another type of social data visualization. This visualization emphasizes social influence. Indeed, this is a treemap of my Twitter followers where the size of someones profile picture is proportional to the number of followers that person has. Its visually appealing because it uses profile pictures and significance is straigthforward: size = measure of social influence. The only issue with this visualization is that some pictures have to be deformed to fit the requirement of a treemap for the overall to fit a rectangle. Such a treemap could be made interactive whereby clicking on a profile picture updates the mosaic with that persons followers. This could be a really nice way of navigating the Twitter graph.
  • #8: Text messages over a year. Dashboard where you can search either by word or by person. The slider enables you to select a word rank (i.e. the first, second, third, etc. most used word). And for a selected word you can see whos the most frequent user of that word. The graph plotting word frequency by word rank illustrates the famous Zipfs law, according to which the frequency of word in function of its rank is a power law, meaning that just a few words make the bulk of our vocabulary use.
  • #9: You can also search for a word by typing it with a real-time suggestion of words.
  • #10: And you can also do a search by person. In that case, you see a ranking of the most exchanged words with that person.A fun application for such data is for instance to calculate the vocabulary size of your friends and see whos the most learned. Thats what I did, but when I posted on the results on the Facebook wall of my friend with the smallest vocabulary size, she stopped talking to me for 2 days. So I dont recommend that course of action.
  • #11: This slide just illustrates an algorithm Ive been implementing in Python to automatically categorize words in a sentence. Basically, the model assumes that the category of previous words in a sequence is predictive of the category of the next word. And so you would choose the category or tag that is both most likely to follow a given sequence of tags and most likely to be associated with the word. Such algorithms can be really useful in sentiment analysis. You could adapt the model to tag the sentiment associated to a word.This can be applied to assess the mood of citizens based on online, real-time text data (typically Twitter).
  • #12: Spatial data is key in analyzing the life of cities. Here I present a series of experimental visualizations of spatial data.
  • #14: Population density. The idea here is that of deformed maps, ie maps that you deform to represent a more abstract reality. I wanted to visualize population density, so my first though was: when something dense, it is heavier; if blown by the wind, the light will fly off further than the dense. Hence this 3D visualization, except that, here, denser countries fly off higher than less dense countries.
  • #16: This project illustrates the application of geostatistics using the ArcGIS software. It represents the spread of dengue fever in the village of Pennathur, India (in 2001). More particularly the goal was to determine if there was a clustering phenomenon. The color of the dots reflect significance of the clustering phenomenon: the darker, the more clustered. The idea is simply to compare the actual distribution of the points to a random distribution: if the number of events found within a certain distance is greater than what we would expect under a random distribution, then the distribution is clustered. Statistical test to check whether departure from randomness is significant or not. Being able to determine the significance of a spatial clustering phenomenon can have many applications for analyzing the life of cities: do certain types of people or certain events (like crimes) happen in clustered manner? Why?Thousands of random distributions are generate and for each, a measure of spatial distribution is calculated. From all these simulations, one takes the highest and lowest values these are the envelope: if beyond that envelope, then significance.
  • #17: In this project, I used the app OpenPaths to track every 3 min my location through my phone. After a few months, I downloaded the data and used the Processing library UnfoldingMaps (by one of your former researcher) to visualized my spatial patterns in time using tile-based maps.
  • #18: Such projects can be useful to create more personalized maps: one could think of deformingmaps,ie changing distances between points to reflect a new spatial relationship between these points, like time to get there or frequency of travelling from one point to the other.Such projects can also be useful to build predictive models of where someone will be in the future.Defining new, more natural borders: clustering observations according to real-world data instead of arbitrary political boundaries. If we notice that a lot of people tend to overuse certain areas of the cities, while other people focus on some other areas, we could create boundaries that really separate areas based on peoples use of these areas. We could also use the spatial patterns to measure of the degree of clustering in spatial patterns: Highly spatially clustered individuals vs. Highly spatially dispersed individuals.
  • #19: Users can switch between various visualization modes to get different perspectives on the same data set: altitude vs (lat, lon)
  • #23: Spatial data is key in analyzing the life of cities. Here I present a series of experimental visualizations of spatial data.