This document provides an overview of an "Exploratorium", which is described as a guided tour of open source data analysis tools. It discusses exploring patent and health care data using tools like Graphviz and NetworkX to analyze networks, and Redis to store and reshape data. Examples are given of analyzing Reddit comment and user networks by counting words and mapping relationships between comments and users. The document encourages sharing favorite open source tools using #exploratorium.
1 of 78
Downloaded 21 times
More Related Content
The Big Data Exploratorium
1. The Big Data
Exploratorium
A guided tour of open source
data analysis tools
Noah Pepper (@noahmp)
Devin Chalmers (@qwzybug)
#exploratorium @osb11
Thursday, June 23, 2011 1
2. Hi,
Were here because...
We are...
Data Exploration Is...
Example 1: Patents
(Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008)
Example 2: Health Care
(Pepper et al. Visweek 2010)
Thursday, June 23, 2011 2
3. Hi,
Exploratorium #1
Patent citation networks
Graphviz
NetworkX
Exploratorium #2
Reddit comment word usages
Thursday, June 23, 2011 3
4. Hi,
Get the code & data samples:
git clone git@github.com:peppern/exploratorium.git
Thursday, June 23, 2011 4
5. Were here because...
There is a really amazing OSS community in the data space.
This is fantastic news for academics, hobbyists, and professionals alike.
We want to show what you can do with open source tools, show you the ones
we like.
Wed love to hear about what YOUR favorites are, #exploratorium to tell us.
Data exploration is fun...
Thursday, June 23, 2011 5
6. We are...
Noah Pepper - @noahmp
Devin Chalmers - @qwzybug
Academic Data Junkies Were Sorta Lucky
Our academic
home. Research
focuses on on
exploring the nature Our startup
of evolutionary where we build data
activity through data exploration
mining platforms
Thursday, June 23, 2011 6
7. We Build Data Exploration Tools!
map.clearhealthcosts.com
Thursday, June 23, 2011 7
8. What is data exploration and what is an exploratorium
Narrow De鍖nition Why do I say
visualization
instead of the more
Data exploration is
general
having an iterative
representation?
relationship with
your data, analysis,
and visualization exploratorium
noun [usu. in names ]
stack where you a scienti鍖c museum or similar center at which visitors have the
build an intuitive opportunity of performing prearranged experiments or
demonstrations.
cognitive model of
the information
Yes! That means
visualized. theres code
and data
Thursday, June 23, 2011 8
9. Data Exploration Example
study evolution of technology in patent records
technology is a window on culture
patents are a window on technology
Thursday, June 23, 2011 9
15. PMI distributions
- see clusters
- different kinds
of clusters
Thursday, June 23, 2011 15
16. PMI Comparison: Plotting a different way
the
PMI integral
halfway rank
optical - generality
of content?
cultivar
Thursday, June 23, 2011 16
17. btw, these are older graphs, now we use ggplot2
Thursday, June 23, 2011 17
18. Previous Work in Health Care...
500,000
400,000
Bill volume
Placement in
distribution of billed
300,000
Upper 5%
200,000
Bottom 5%
100,000
0
AMB ASC DME ER IPH OPH PRO
Adjudication type
.... with @homerstrong
at Qmedtrix Systems Inc.
Thursday, June 23, 2011 18
19. Previous Work in Health Care...
120,000
Bill volume
100,000
80,000
60,000
40,000
20,000
0
10 1
10 2
10 3
10 4
10 5 10 6
10 7
1.4e+09
1.2e+09
Dollar density
1.0e+09
8.0e+08
Billed
6.0e+08 First Audit
4.0e+08 Second Audit
2.0e+08
0.0e+00
10 1
10 2
10 3
10 4
10 5 10 6
10 7
Amount ($)
... @hadleywickham is a #ballR
http://had.co.nz
Thursday, June 23, 2011 19
20. Health Care Data & Code Samples...
...Hahaha Just Kidding
Thursday, June 23, 2011 20
21. But actually:
Qmedtrix R&D team members made source contributions, see:
Homer Strong https://github.com/strongh @homerstrong (Lucky Sort)
Kevin Lynagh https://github.com/lynaghk (Keming Labs)
Thursday, June 23, 2011 21
22. Exploratorium #1 Patent Networks
citations
amongst
top 10k
most cited
patents
Thursday, June 23, 2011 22
23. Grab the graph data:
~/exploratorium/patents/toplinks.dot
Graphviz Art is Pretty!
Thursday, June 23, 2011 23
24. GraphViz Can
Graph really big
graphs... but they
get hard to use ->
<- Psychedelic
Patents
Thursday, June 23, 2011 24
25. Graphviz - Play with Graphs
(http://www.graphviz.org)
sudo port install graphviz or sudo apt-get install graphviz
graphing commands: dot,neato,twopi,circo,fdp
dot -Tpdf -o 鍖le.dot
More options here:
http://www.graphviz.org/content/command-line-invocation
Fun options are in the .dot 鍖le:
http://www.graphviz.org/content/dot-language
Thursday, June 23, 2011 25
26. Styling dots
node [shape=point, width="0.15",color="#0000001c"];
edge [arrowsize="0.50", color="#0000001c"];
There are tons, read the docs and have fun
You can also try more complex things
Like constraints, time for example
Sometimes too many constraints makes GraphViz unhappy...
Thursday, June 23, 2011 26
28. UbiGraph
We loved UbiGraph, but dont know an OSS alternative
Renders many nodes in 3D in realtime FD-layout (50k+).
16gb of ram Mac Pro
Shout out to Apple: thank you for supporting our research!
Its free but development has stalled and since its closed source we cant
build on it!
Alternatives?
Thursday, June 23, 2011 28
29. Exploratorium #2
Making graphs of language using python, redis, R and a bunch of awesome
libraries
Thanks
@hadleywickham
@homerstrong
@antirez
Bryan Lewis (http://illposed.net/)
Thursday, June 23, 2011 29
30. ...how?
Mine Munge Visualize
Thursday, June 23, 2011 30
39. Store the data
Postgres is not too shabby
Thursday, June 23, 2011 35
40. Store the data
SELECT cite AS patent_num, count FROM (SELECT cite,
count(*) AS count FROM citations GROUP BY cite) AS t1
ORDER BY t1.count DESC LIMIT 10
Thursday, June 23, 2011 36
41. Store the data
SELECT `cite`, count(*), `year` FROM `citations`
INNER JOIN (SELECT date_part('year', `grantdate`) AS
`year`, `patent_num` AS `patent_num` FROM `patents`)
AS `t1` USING (`patent_num`) WHERE (cite IN (12345))
GROUP BY `year`, `cite`
Thursday, June 23, 2011 37
42. Store the data
SELECT term, count FROM (SELECT term, count(*) FROM
(SELECT patent_num, term FROM tfidfs WHERE (tfidf >
0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT
patent_num FROM patent_lengths WHERE (wordcount >
10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE
(grantdate > '1990-01-01' AND grantdate <
'2000-01-01')) AS "t2" USING ("patent_num")) AS "t2"
USING ("patent_num") GROUP BY "term") AS "t3" ORDER
BY count DESC LIMIT 50;
Thursday, June 23, 2011 38
62. Reddit
Count words by hour
Thursday, June 23, 2011 50
63. Reddit
Count words by hour
Comment network
Thursday, June 23, 2011 50
64. Reddit
Count words by hour
Comment network
User network
Thursday, June 23, 2011 50
65. Reddit
Count words by hour ZSET subreddit:2011-06-21:12
Comment network
User network
Thursday, June 23, 2011 50
66. Reddit
Count words by hour ZSET subreddit:2011-06-21:12
word [count]
Comment network
User network
Thursday, June 23, 2011 50
67. Reddit
Count words by hour ZSET subreddit:2011-06-21:12
word [count]
Comment network SET thread_id:comments
User network
Thursday, June 23, 2011 50
68. Reddit
Count words by hour ZSET subreddit:2011-06-21:12
word [count]
Comment network SET thread_id:comments
parent_id:child_id
User network
Thursday, June 23, 2011 50
69. Reddit
Count words by hour ZSET subreddit:2011-06-21:12
word [count]
Comment network SET thread_id:comments
parent_id:child_id
User network SET thread_id:users
Thursday, June 23, 2011 50
70. Reddit
Count words by hour ZSET subreddit:2011-06-21:12
word [count]
Comment network SET thread_id:comments
parent_id:child_id
User network SET thread_id:users
parent_id:child_id
Thursday, June 23, 2011 50
71. Reddit
Count words by hour ZSET subreddit:2011-06-21:12
word [count]
Comment network SET thread_id:comments
parent_id:child_id
User network SET thread_id:users
parent_id:child_id
SET subreddit:threads
Thursday, June 23, 2011 50
72. Reddit
Count words by hour ZSET subreddit:2011-06-21:12
word [count]
Comment network SET thread_id:comments
parent_id:child_id
User network SET thread_id:users
parent_id:child_id
SET subreddit:threads
thread_id
Thursday, June 23, 2011 50
75. Reddit
Go forth and graph!
#exploratorium #osb11
Thursday, June 23, 2011 53
76. Reddit
Go forth and graph!
#exploratorium #osb11
We will hire you.
Thursday, June 23, 2011 53
77. Reddit
Go forth and graph!
#exploratorium #osb11
We will hire you.
For reals.
Thursday, June 23, 2011 53
78. You Are Now Leaving
the Big Data
Exploratorium
Please ensure you have your
valuables.
Noah Pepper @noahmp
Devin Chalmers @qwzybug
#exploratorium #osb11
Thursday, June 23, 2011 54