Disparate Data, Technology Fiefdoms (and 65 pictures of your cat)
Up in the frozen wastes of the Northern British Columbia, we organized a hackathon. We based it on the ideas of open data and civic applications.
Our hardy hackathoners pulled together a number of excellent ideas but met with a constant and obtrusive barrier: that open data maybe open but with out some level of standardization its not actually very useful.
Now, no one said that data had to be useful, and perhaps if we want the technology utopia of real open data interoperability we will need to build it ourselves, but it is worth noting that talking the same language as our neighbours is generally awesome. Indeed, perhaps rather than swearing fealty to our technology overlords and just pressing the publish document to open data platform button, we could think about the commonwealth of data. The value of any data increases wildly with density and open data should be more valuable!
The cats? well youll have to tune in for that bit.
11. Do we really care about the software?
> sparkgeo.com
12. (I stole this picture from the internet & I cant remember from who, sorry!) > sparkgeo.com
13. There are estimated to be
14,308,667,560 pictures of
cats on the internet
There are only thought to
be 220,000,000 domestic
cats in the world
So, every cat has had 65
pictures of it posted on the
web? Right?
> sparkgeo.com
17. > sparkgeo.com
And here is a picture
of a cat I found on
the internet
Its grumpy
Editor's Notes
#3: Data comes in many shapes and forms.
As geographers we use data everyday, but we should note that data comes in many forms it sits on an infinite spectrum of possibility, only constrained by our common understanding of the universe we live in.
In this example, we understand temperature, we probably get the idea of check ins, I know that 2% is 2 in every hundred, and I have a strong grasp of ice cream. This factoid is also contextually geographic, which is nice. The point here is that I can understand what this data point means without much further explanation.
Beyond that this is a solid, general statistic.
One which has been derived from a vast array of crowd sourced Foursquare data. But it is also relevant in our greater understanding of ice cream marketing, perhaps or check-in behavior or even summertime habits.
It sits on the spectrum.
#4: I love data, I love understanding data, I love the complexity of data and I love joining data together to develop insights.
This is detective work
Its deductive
I started sparkgeo 4 years ago but even before that I was deeply embedded in data.
I started my professional life with geostatistical analysis studying the optimization of soil sampling strategies using a normalized difference vegetation index imagery derived from a combination of low cost remote aerial remote sensing and near-infrared videography.
We then studied the spatial distribution of the chemo-types of scots pine saplings in royal forest at Balmoral, back in the soon to be independent country of Scotland
After that I helped to clean up a corporate address database through the automated matching of addresses, places and people.
Since crossing the pond I have been analyzing forestry and resources data. But most most recently at Sparkgeo we have been helping social networks understand location.
Data has been a theme for me. I imagine that my story is somewhat similar to many of you.
Data data data
#5: But what drives technology?
There was a relatively recent time when clock speed or pixel depth or some specification would drive technology
Thats changing, now instead of specs, we look at features and when we look at features we are looking at data
Data is now driving our experiences
#6:
Data drives your experiences of the internet, and in fact many other parts of your world.
It is a measure by which we at sparkgeo are graded.
In the end it doesnt matter how good your map technology is, if the data is wrong, the technology fails.
I would argue that in reality many of our companies and organizations have moved form being something, to effectively being data organizations
#7: So as a technology company, we are also a data company.
We live at the intersection of technology and data within the context of geography.
#8: I think its worth noting that in BC we are very lucky.
Our access to data resources is excellent in comparison to many other jurisdictions.
So, first up, I want to congratulate the who helped make this happen (maybe they are here?).
I think this story of progressive openness is being witnessed widely across numerous countries, states, provinces, regions and cities.
Great job guys.
Now, being a Scotsman, Im never actually happy or satisfied with anything, so I will tell you a story
#9: Its starts with hackathon up in Prince George.
We were looking specifically at open data from our City and Regional District
We had various teams but one team, in particular had a problem. They took the simple idea of open data a little further and they wanted to compare the financials of different municipalities to see which would give the biggest bang for the buck in terms of tax dollars. The idea being they would be able to give consumers, citizens an idea of the best value municipality in which to live.
Seems reasonable, and pretty interesting
This turned out to be a very difficult exercise.
Mainly because as it turns out no one is talking the same language
#10: And by language Im not talking about spoken, written, program-metric languages or even data transfer formats, of course
Im talking about raw data. The data products being published from different municipalities didnt actually support any kind of comparative analysis
That hackathon team were left comparing apples with oranges. Because in the vast spectrum of data, the municipalities of BC had found themselves seeing and measuring their worlds in slightly different ways.
Slightly different leading to totally different data products.
The point here is not to beat on the municipalities too hard. In reality these organizations are travelling in unknown waters
and the struggle they face to publish anything is not insignificant.
#11: So in the review of appropriate data for a comparative analysis of budgetary processes we found that a plethora of different technologies at play. Each technology providing data in different ways.
In the geo space we also see many many tools and technologies.
The expectation here was not to have found exactly the same data available for each location, but for them to at least be different dialects of the same language.
Instead what we found was enormous complexity.
As a geospatial guy, I kinda knew this would be the outcome. In fact I must admit to rather enjoying watch this hackthon team struggling with a problem I often face.
Secretly hoping they would find a way I had not identified, but being personally validated by their final conclusion, which was that although open data is readily available to all, it is, by design not all to be used, and certainly not at the same time.
The barriers here are many and complex. They are human barriers, they are technology barriers.
We have environments, security, FOI, licensing and vendors to consider.
#12: This got me thinking.
Do we really care about any of the considerations?
How long will it be before you move on to the next piece of software for serving, disseminating or to your data
When will we be adopting the next high speed data transfer format.
With this in mind its worth considering the process of just publishing an open data website simply because you have a new tool to do so.
The real value to open data is OF COURSE the data, not the technology housing it, or the software supporting its distribution, but the *actual* data.
AND MORE SO, The values in those tables, and the value of each data point increases everyday with its temporal depth.
This happens entirely independently of technology, the value is in inherent in the data itself.
Everyday more data is added to that repository, it becomes more valuable.
So, we should make sure we are capturing and publishing the right data. If were not then we are facing an opportunity cost to our investment in *data*.
Ok, then back to that hackathon, lets think about context
#13: Well, without context, we can get a skewed impression of what our world looks like.
Unless we have a good idea of what is happening elsewhere, we might miss the bigger picture
Because, another way to add enormous value to our data is to publish it in commonly understood ways
Take cats, for instance
#14: University of Abster did a wonderful study.
They estimated there to be 14 billion images of cats on the internet (and 2.7% of which are pictures of cats with bread around their face.)
Indeed there are estimated to be only 220 million domestic cats in the world
But whats the point here? Well this massively popular phenomena is derived through a combination of cute-ness, convenience and compatibility.
Think about it this way:
Each cat picture data point is commonly understood by both the computer and the person. There are only a few popular image formats and in the most part they are well documented. The ability to take a cat picture is also somewhat ubiquitous. These data points are perhaps slightly different dialects of the same language of cat pictures.
So these are easy to share, they are easy to manipulate and they are easy to reuse.
Oh wait, thats what we want from open data too, yeah?
In fact in terms of the multiplication factor of temporality we talk about earlier, consider the network effect of commonly publishing comparable datasets.
#15: A quick and easy example,
If you have any interest in seeing your data be easily pulled into larger analysis products or processes, then you might want to
#16: We can argue the various merits of our favorite reference systems, but in reality these two systems are ubiquitous global mapping systems. I am absolutely sure that distance and area are far more accurate in your local conic conformal or your goode homolsine interrupted looks wonderful up on your wall, but the world is speaking web mercator, so lets move on. In reality, this is usually a single line of code or a simple button press.
Committing to the common wealth of data is the key
#17: So, what my point here.
Well, the key thing is that for instance:
Every individual municipalitys data becomes more valuable the more it can be commonly understood within the context of other municipalities.
Every Province, Territory, State or Entitys data becomes more useful the more it can be placed within a bigger context.
In short, I propose that we congratulate ourselves of making a huge leap forward in publishing data. But now we start to think about what to publish
And ideally we try and publish the same thing.