Bridging the Gap Between Data and Insight Using Open-Source Tools (ODSC Boston 2015)

•Download as PPTX, PDF•

1 like•811 views

These are the slides for a talk I gave at the 2015 Open Data Science Conference in Boston, Massachusetts. You can check out the interactive map here: http://bit.ly/arcolano-odsc2015-mapbox You can get the Jupyter notebook here: http://bit.ly/arcolano-odsc2015-jupyter ABSTRACT Despite the proliferation of open-source tools for analysis (such as Python and R) and those used for visualization (such as Javascript / D3), there often exist significant gaps between these areas, and those of us trying to navigate the complete arc from data to insight can encounter many obstacles along the way. Fortunately, in recent years there have been many efforts to fill these needs, and today distilling a meaningful visualization from raw data is faster and easier than ever before. In this talk we will use will use examples in geospatial analysis and visualization to illustrate how to open-source tools can work together. Using examples from the RunKeeper mobile app we will show how we currently use these tools to understand better our customers and their data, and to communicate with our colleagues, external partners, and the data community at large.

Bridging the Gap Between Data and Insight Using Open-Source Tools (ODSC Boston 2015)

More Related Content

Editor's Notes

My name is Nick Arcolano and I'm a data scientist at RunKeeper. I'd like to talk today about how I've been using open source tools in my day-to-day work at RunKeeper to help shorten the time it takes to get from data to insight. I should note that I'm usually a PowerPoint guy, but in the spirit of this open-source conference, I generated this talk directly from a Jupyter notebook. Ironically, when I saw how spotty the wifi was yesterday I became skeptical than any of my embedded content would load, so I tried to convert the talk to a PDF, but then it wouldn’t render properly, so in the end I finally gave up and last night I just put a bunch of screenshots back into a PowerPoint presentation.
For context, let me give you some brief background on RunKeeper and what we do.
More than 40 million people have downloaded RunKeeper, and as you'd expect, these users have generated a lot of data. One of the user interactions we care most about is when someone records a fitness activity. More than 500 million of these have been recorded by our users, and the majority of them have an associated GPS track. So, we have a huge collection of user-generated GPS data, consisting of hundreds of billions of points.
So, then, what do we do with all of this data? Since we're a small data team at a start-up with very limited resources, data science at RunKeeper covers a lot of ground, including (but not limited to) the areas I've listed here. One common thread, though, is that these areas involve turning raw data into actionable insights—as quickly as possible.
This chart describes a typical workflow for pretty much any data science task I do, and I'll bet it looks pretty familiar to many of you do. Much of the user, event, and production data eventually ends up in a data warehouse in formats and architectures more suitable for analytics. I often work in SQL to do some basic data wrangling and analysis, but if I want to do anything complicated, I try to get an aggregated sample of the data as quickly as possible into an environment that excels at exploratory data analysis. Typically this is Python, which I'm usually using interactively through IPython (or now Jupyter). It's in this interactive, exploratory environment that I can rapidly iterate to figure out what will be the most valuable outputs of my analysis, whether they're slide decks or reports, prototypes for dashboards, data visualizations, etc. Also, if you’re keeping track, this is in fact a PowerPoint slide of a screenshot of a reveal.js slideshow generated from a Jupyter notebook which originally contained a screenshot… of a PowerPoint slide.
For me, the true measure of a data analysis tool is how rapidly it lets me traverse this flow, getting from raw data to insight as quickly as possible. We're all here because open source tools have made connecting these elements easier than ever before. In particular, geospatial data analysis is one of the best examples of how all the pieces are coming together in new and exciting ways, so that’s what we’re going to talk about for the rest of my time here.
Working with geospatial data is an area that I'm fairly new to, but one that I'm very excited about. For a long time geo data analysis seemed like way too big of an investment, like something that you had to be a "geo expert" to do, as opposed to it being another complementary part of your data science toolset.
Here's a selection of some of the open-source projects and products I've encountered while learning to work with geo data. This is just a small sample; there are a lot of great libraries for geo data now, especially for Javascript and Python, and more are being developed all the time.
With the time we have left, I'll go through an example analyzing some RunKeeper data in Python. Our data will be in the GeoJSON format, and we'll use the geopandas library to get some quick-and-dirty map visualizations. Full disclosure, I am by no means a geo expert, and that's kind of the point here. What's more interesting is how someone like me with a general background in data science and analytics tools can get to answers quickly by using the right combination of open-source tools.
So, for any particular trip (such as trip number 328,635,286), I have a set of GeoJSON files representing each 200-meter segment in that trip.
Now, let's load a whole array of segments files. You can see each one has a geometry member, which is of the LineString type and contains an array longitude-latitude-altitude coordinates for that 200-meter part of the run. The GeoJSON also contains some other properties, such as the original trip ID, distance, and pace, which were derived when the file was originally created. One property I have in here is "speed ratio", which I computed as the ratio between the user's running speed for that segment relative to their average speed over the entire run. We’ll be using this later.
Geopandas extends the pandas library, one of the primary tools for working with data in Python. The primary object we'll work with is the GeoDataFrame, which is an extension of the DataFrame object in pandas, but with the ability to store and work with geospatial data. The geometry part of the GeoJSON is automatically converted into a special geometry column, while the properties are automatically converted into additional columns. With this GeoDataFrame we can do all the things we can do with traditional data frames, like slicing, pivoting, and aggregation, but now we can do some of them with geospatial operations.
So now, with our GeoDataFrame in hand, there’s a lot of things we can do. For starters, it’s easy to get a sense of what our data looks like. If you just do "plot" here, you can see all of the segments in this trip (represented by the different colors).
If you want more context, though, Jake Wasserman of Mapkin wrote this handy little utility to view geo data on the Mapbox geojson.io site embedded right in IPython. So, you can see here that these segments are from a run around the Charles River. (Actually, it's one of my own runs, which it looks like I started from the RunKeeper office on Canal Street in downtown Boston.)
A DataFrame with just one run isn’t all that interesting, so now we'll look at segments from 10,000 random running trips, which ends up being around 300,000 segments. (I’ve hidden the details of the loading here but it’s similar to what I just showed.) Because we've loaded everything into a DataFrame, we can do any quick-and-dirty analysis that we want, like cleaning or filtering the data or computing basic statistics.
Also, we can do a quick sanity check and plot some of these segments to make sure they are what we think they are. You can see that they are indeed from the trips in the greater Boston area.
Now, here's where things start to get interesting. Let's say we decide that what we really care about right now are only the segments in Cambridge. Fortunately, the City of Cambridge has some lovely geo data that we can load up. (You'll note here that the data I happened to download is in a local coordinate system, and so I needed to transform the data into a different coordinate reference system.) Geopandas lets us compute logical indices based on spatial operation, such as whether each segment is within the boundary of Cambridge, and then we can look at just those segments.
We can do another sanity check and see that our segments seem to be entirely in Cambridge now.
Now let's start to ask some interesting questions about the data. Maybe from the running data we can see which major intersections slow runners down the most? Recall the "speed ratio" for each segment I mentioned earlier, which is the ratio of running speed for that segment to the average speed over the total duration of the trip. (Note that if I hadn't done this ahead of time, it still would be possible to do now while I have the data in hand—such is the power of the DataFrame.) First, we'll load up some point data for intersections in Cambridge. Then, we add a small buffer around each intersection to turn each geo point into a little bubble, and then iterate over them and perform a geo operation to find which segments intersect each of them. We also compute a median. Finally, we can then sort the intersections based on which ones slow runners down the most.
Let's look at some of the worst major intersections, by filtering on segment count and looking at the 10 with the lowest speed ratios. You see some expected results here (like Harvard Square), but you also see a lot of points along the river. Maybe these are real problem areas for runners, or maybe it's just people walking at the beginning or end of their run, or maybe there are some boundary effects from the way we selected the segments. Regardless, we're already learning a lot about what patterns and edge cases might arise if we tried to extend this analysis at scale.
Let's say, instead, we want to know on which roads the fastest runs are happening. To take a crack at this, we load up some more of this beautiful City of Cambridge GIS data, and use another geo operation to assign each segment a road ID based on the closest road geographically. We then group by road IDs and compute median running speed and segment counts, similar to what we did with the intersections.
Now it's straightforward to see where some of the fastest running is happening in the city. Here's the fastest stretch of roads, out of the roads with at least 100 segments. It's near MIT, and in fact if you were to look at the whole map, we'll see that a lot of fast running happens around the MIT campus.
Because we've been working with open-source tools using open-source standards, we're pretty well covered if we want to export this data to work with another tool. For example, I can dump the road network (now augmented with running speeds) to a GeoJSON file, and then import this in a map styling tool like Mapbox Studio.
Here's a screenshot of the Cambridge road data live on Mapbox, all styled and overlaid on their OSM layers. As you can see I’m hardly a master cartographer, but it was pretty straightforward to get this data loaded up and styled so that I can start to make some sense of it. You can already see some patterns, like faster running around the MIT campus and along the river, while slower running happens around some of the messier intersections (like Harvard Square) and on a lot of the side streets.
I always hate when people give the "cooking show" version of a talk and show you how easy everything is, but when you try to do it yourself you realize there are all these messy, difficult bits that they don't show on TV. So, some caveats: first, I did all this with data on disk instead of using a geo database, so it’s somewhat slow and certainly not a scalable solution. However, it works fine for prototyping, and with the things we’re learning we're going to save ourselves a lot of time later when we finally start to design something to work at scale. Also, even with these great tools, geo is still challenging and can require a huge amount of expertise. For example, I glossed over some issues about map projections, and honestly that's something that I needed some help with when I tried to do this on my own. You'll also might guess that I didn't exactly pick Cambridge by accident; their data is super nice to work with and also it kept me from having to figure out how to get and work with OpenStreetMap data directly. Finally, like many open source tools geopandas is still a work in progress, so the usual "use at your own risk" warnings apply there as well.
Even with all these disclaimers, at the end of the day I was able to do something in a new domain, and I was able to do it fairly quickly and painlessly, and I have the open source community to thank for that. In the process I learned a lot about our own data, and about the potential opportunities and challenges if we were to continue pursuing approaches like these to support things like user research, product features, brand or partner insights, etc. You’ve probably figured out by now that I’m a big fan of rapid prototyping—Benjamin Franklin said that “an ounce of prevention is worth a pound of cure”; I believe that an “ounce of prototyping is worth a pound of development”. That to me is the real value of these tools—being able to get from a pile of data to something that helps us make decisions and move forward as rapidly as possible.
Special thanks to Jake Wasserman of Mapkin, my team at RunKeeper, and to the Open Data Science Conference for inviting me to speak.

�ݺ�ߣ