�ݺ�ߣ

Processing planetary sized datasets

# UserId, ActivityId, Latitude, Longitude, Timestamp
101528757,285751033,51.517227,-0.101553,1429087808
101528757,285751033,51.517296,-0.101817,1429087812
101528757,285751033,51.517353,-0.102064,1429087816
101528757,285751033,51.517445,-0.102144,1429087820
101528757,285751033,51.517475,-0.102259,1429087824
101528757,285751033,51.51743,-0.102343,1429087828
101528757,285751033,51.517338,-0.102309,1429087837
101528757,285751033,51.517307,-0.102303,1429087857
101528757,285751033,51.517296,-0.102346,1429087864
101528757,285751033,51.517284,-0.102388,1429087877
101528757,285751033,51.51729,-0.102321,1429087959
101528757,285751033,51.51729,-0.102248,1429087961
101528757,285751033,51.517338,-0.102088,1429087965
101528757,285751033,51.51737,-0.10196,1429087969

Total Size: 560GB
Activities: 2.5 million
GPS Locations: 3.5 billion

• Many activities
per user.
• Want to be able
to pull a time
range of user
locations for
activity display.

User Id Timestamp Latitude Longitude
…
10152875766888406 1445374423000 36.966819 -122.012298
10152875766888406 1445377625000 36.966845 -122.012248
10152875766888406 1445377627000 36.966877 -122.012228
10152875766888406 1445377629000 36.966913 -122.012236
10152875766888406 1445377630000 36.966946 -122.012236
10152875766888406 1445377631000 36.966984 -122.012263
10152875766888406 1445379512000 36.967027 -122.012281
…

This is a challenge with a large dataset:
• A traditional relational database typically requires
hand sharding to scale to PBs of data (eg. Postgres).
• Highly indexed non relational solutions can be very
expensive (eg. MongoDB).
• Lightly indexed solutions are a good fit because we
really only have one query we need to execute
against the data.

PartitionKey
(userId)
RowKey
(timestamp)
Latitude Longitude
1015287576688840
6
1445377623000 36.966819 -122.012298
1015287576688840
6
1445377625000 36.966845 -122.012248
1015287576688840
6
1445377627000 36.966877 -122.012228
1015287576688840
6
1445377629000 36.966913 -122.012236
1015287576688840
6
1445377630000 36.966946 -122.012236
…

• Want to query a
set of activities
in a bounding
box.
• Also want to
filter activities
based on
distance and
duration.

activity id start (sec) finish (sec) distance (m) Duration (m) bbox
(geometry)
101528 1445377625 1445383025 50023 6222 [-104.990,
39.7392...
101643 1445362577 1445373616 28778 2498 [-122.01228,
36.96…
101843 1445377627 1445382432 4629 701 [0.1278, 51.5074
…
101901 1445362577 1445374713 99691 14232 [139.6917,
35.699...
102102 1445374713 1445374713 25259 6657 [1.3521,
103.8129…

user Id timestamp latitude longitude
1015287576688840
6
144537762
3
36.966819 -122.012298
1015287576688840
6
144537762
5
36.966845 -122.012248
…
1015287576688840
6
144538302
5
36.966913 -122.012236
1015287576688840
6
144538303
0
36.966946 -122.012236
activity
id
start finish … bbox
101528 1445362577 1445373616 … [-104.990, 39.7392...
101643 1445377625 1445383025 … [-122.01228, 36.96…
101843 1445377627 1445382432 … [0.1278, 51.5074 …
101901 1445362577 1445374713 … [139.6917, 35.699...
102102 1445374713 1445374713 … [1.3521, 103.8129…
Location Data
(Azure Table Storage)
Activity Data
(Postgres + PostGIS)

• Total number
of location
samples in a
geographical
area.
• Whole dataset
operation.

• Based on Apache Spark
• Offered as a Service in Azure as HDInsight
Spark.
• Can think of it is as “Hadoop the Next
Generation”
• Better performance (10-100x)
• Cleaner programming model

• Divides world
up into tiles.
• Each tile has
four children at
the next higher
zoom level.
• Maps 2
dimension
space to 1
dimension.

For each location, map to tiles at every zoom level:
(36.9741, -122.0308)  [
(10_398_164, 1), (11_797_329, 1)
(12_1594_659, 1), (13_3189_1319, 1),
(14_6378_2638, 1), (15_12757_5276,1),
(16_25514_10552, 1), (17_51028_21105, 1),
(18_102057_42211, 1)
]

def tile_id_mapper(location):
tileMappings = []
tileIds = Tile.tile_ids_for_zoom_levels(
location['latitude'],
location['longitude'],
MIN_ZOOM_LEVEL,
MAX_ZOOM_LEVEL
)
for tileId in tileIds:
tileMappings.append(
(tileId, 1)
)
return tileMappings

Reduce all these mappings with the same key into an
aggregate value:
(10_398_164, 151)  [
(10_398_164, 15), (10_398_164, 28)
(10_398_164, 29), (10_398_164, 17),
(10_398_164, 31), (10_398_164, 2),
(10_398_164, 16), (10_398_164, 2),
(10_398_164, 11)
]

lines = sc.textFile('wasb://locations@loc.blob.core.windows.net/')
locations = lines.flatMap(json_loader)
heatmap = locations
.flatMap(tile_id_mapper)
.reduceByKey(lambda agg1,agg2: agg1+agg2)
heatmap.saveAsTextFile('wasb://heatmap@loc.blob.core.windows.net/');
Building the heatmap then boils down to this in Spark:

6_25_31
9_201_249
9_201_250
9_201_248
9_201_245
9_201_247
8_100_124
8_100_125
8_100_126
7_50_62
7_50_63
7_50_64

User Id Activity Id Timestamp Latitude Longitude
…
10152875766888406 57169639 1445377623000 36.966819 -122.012298
10152875766888406 57169639 1445377625000 36.966845 -122.012248
10152875766888406 57169639 1445377627000 36.966877 -122.012228
10152875766888406 57169639 1445377629000 36.966913 -122.012236
10152875766888406 57169639 1445377630000 36.966946 -122.012236
10152875766888406 57169639 1445377631000 36.966984 -122.012263
10152875766888406 57169639 1445377632000 36.967027 -122.012281
…

let azure = require('azure-storage'),
elevationService = require('../services/elevation');
module.exports = function(context, locationBlob) {
let locations = locationBlob.split('n');
elevationService.enrichLocations(locations, (err, enrichedLocations) => {
if (err) return context.done(err);
// ... save enrichedlocations to blob ...
context.done();
});
};

560 GB 224 GB
JSON Avro
60%
Smaller

• geotile: http://github.com/timfpark/geotile
• XYZ tile math in C#, JavaScript, and Python
• heatmap:
http://github.com/timfpark/heatmap
• Spark code for building heatmaps

�ݺ�ߣ

Processing planetary sized datasets

More Related Content

Processing planetary sized datasets

Editor's Notes