This document discusses DT's core analytical competencies in data engineering, analytics, and quantitative skills. It describes capabilities in areas such as data architecture, ETL, spatial data services, data transformation, reporting, data mining, spatial data mining, and quantitative skills in statistics, machine learning, spatial statistics and other applied mathematics. It also provides examples of analytics applied to problems involving time series anomaly detection, correlation, aggregation, graphs, movement patterns, and classification. Teams have degrees from top universities and expertise in fields like computer science, engineering, mathematics and social sciences.
1 of 22
Downloaded 14 times
More Related Content
Capabilities Brief Analytics
1. DT Core Analytical Competencies
Data Engineering
Data Architecture Design and Development
Large Scale Enterprise Architecture and Design
Migrate, Extract, Transform, and Load Data
Spatial, Multi-Domain, and Cloud Base Data Services
Analytics Quantitative
Data Transformation and Ingestion
Dissemination and Reporting Tools
Data Mining, Exploitation, and Correlation Tools
Spatial Data Mining and Geographic Knowledge Discovery
Data Tactics Corporation Proprietary and Confidential Material
2. DT Core Analytical Competencies
The Team:
Graduates of top tier universities to include
Stanford, Caltech and MIT as well as ties to these and
local universities.
Degrees include Mathematics, Computer
Science, Aeronautical
Engineering, Astrophysics, Electrical
Engineering, Mechanical Engineering, Statistics and
Social Sciences.
Competencies include data mining, machine
learning, statistics, spatial statistics, Bayesian
statistics, econometrics, computational geometry, spatial
econometrics, applied mathematics, theoretical
robotics, dynamic systems, control theory.
Foci include unsupervised cross-modal clustering
algorithms, principle component analysis, independent
component analysis, regression, spatial
regression, geographic weighted regression, zeroth order
processing, nonlinear optimization, autoregressive
models, time-series analysis, spatial regime models, HAC
models.
Technical Competencies include Data Tactics Corporation Proprietary and Confidential Material
3. Data Tactics Analytics Cell
Data Tactics Corporation Proprietary and Confidential Material
4. Analytics Competencies
ZeroFill
40
Time Series Analytics (i) (i)
0
Applying the ARIMA model in a
02-13
Index
parallelized environment to
provide anomaly detection
Correlation Analytics (ii)
Brute force pairwise Pearsons
correlation over vectors in a
cloud-backed engine
Aggregation Analytics (iii)
Aggregate micro-pathing
Repurposing data to analyze (ii)
and display movement
patterns
Dwell time calculations
Analytic to discover areas of
interest based on movement
activity
Graph Analytics (iiii)
Discovering social interaction
models and paradigms within (iii)
network data (iiii) 4
Data Tactics Corporation Proprietary and Confidential Material
5. Analytics Competencies
Directional Spatio-Temporal
Analytics (i) (i)
Compare distributions with a focus on
changes in morphology of the
distribution and mobility of individual
observations within the distribution
over that same period of time over
space (Wy)
Local Classification (ii)
Non-self-similarities & self-similarities; (i)
within and between group
correlations.
Ecological Analytics (ii)
Regression Modeling
Spatial Regression
Spatial Regime Models
HAC Models 5
Data Tactics Corporation Proprietary and Confidential Material
6. Data Tactics Data Repository
Data Tactics Corporation Proprietary and Confidential Material
7. Quantitative Data Competencies
Proxy problems definition Different problems lead to different questions, which lead to
different data sets. Confer acceptability of data source by the definition of the proxy problems.
Key dimensions of variability Key dimensions were targeted for collection such as
time, space, identifier, etc. However, different proxy problems require different key dimensions.
Capturing scope The following was explicitly captured:
Data structure (E.G. graph relationship data vs. graph transaction data vs. dimensional data)
Data timespan (if time is a dimension)
Data geospatial footprint (if geospatial is a dimension)
Data volume (both in total GB and also in total # of rows)
Determining dataset overlap
Capturing opinions - Current star ratings based on:
Data consistency, volume, and persistence
Data coverage (time and space)
Data precision (time and space)
Data genuineness (synthesized data is penalized)
Data distribution (IE: we may have extremely precise geo-spatial data, but if there are only 40
unique geospatial points in the data, the geo-spatial aspects arent that interesting)
Data dimensionality (higher dimensionality with reasonable distributions on each dimension is
preferred)
8. Quantitative Data Holdings
Name of the Data Date that statistics
Source were last collected
Initial reviewer on data
Location of data Data
Opinion of Data Source where on FTP site format
Quality Collection start /
data was
Description and end dates if
acquired Size of Data
notes on data source known (storage space
as well as collection Geospatial and rows)
Data handling
information coverage requirements
Data Tactics Corporation Proprietary and Confidential Material
10
9. Quantitative Data Holdings
Armed Conflict Location and Events Dataset (ACLED) KDD 2003 Data
AIS Ship Data KDD 2005 Data
Atmospherics Reports Kiva Data
BrightKite Data Landscan Data
Classified Ads LiveJournal Data
CNN Meme Tracker
Digital Terrain Elevation Data (DTED) Meme Twitter TS
Enron Data NFL Plays
Epinions Data Night Lights Data
EU Email Open Data Airtraffic accidents
Facebook Open Street Maps
Flickr Data Panoramio Data
Flight Information Data Patent Citations Data
Four Square Data Photobucket Data
Friend Feed Data Picasa Web Albums Data
Geolife Data Processed Employment Data
Gowalla Data Scamper Data
International Conference on Weblogs and Social Media ISVG
(ICWSM) Data Twitter
Identica Data UNDP
IMDB Data Weather Data
Knowledge Discovery and Data (KDD) Mining Tools Webgraphs
Competition Youtube
Data Tactics Corporation Proprietary and Confidential Material
10. Quantitative Data Competencies
Panoramio / Flickr Metadata on uploaded public photos provides excellent geospatial and
temporal resolution, which also provides user information. Estimated 250 million rows of photo metadata
with over 150 million already gathered.
AIS Ship tracking data that provides ship pings as they progress in movement. Precise time and
geospatial information provided. 50 million records and counting.
OpenStreetMaps Over 2 billion geospatial points of mapping enthusiasts tracks across the
world. Time and userid information also included.
Gowalla / Brightkite About 11 million FourSquare style check-ins with user, location, and
time information provided.
Example Proxy Problems:
Discovering Holes in the data where photos are no longer taken to detect avoided areas
Discovering relationships and links based on co-occurrence between users in time / space
Tracking and analyzing movement patterns on a local and global scale
Analyzing image data for changes in the same locations
Detecting differences in photo activity in an area over time
Detecting events based on abnormal photo activity behavior
Mapping UserIds across data sources to create a unified analytic picture
Detecting home range for each user
Defining patterns of life by routine activities and movement
Tracking language usage in areas to determine abnormal language presence in an area
Local vs tourist movement analysis and extraction
Trending of location popularity
UNCLASSIFIED 12
11. Quantitative Data Competencies
Twitter Sampled ongoing collection of social media tweets with UserId and time.
Some even have precise location data, but this is not the norm. Collection pulls roughly
between 1-2 million tweets / day.
Example Proxy Problems:
Discovery of crowd-sourced phenomena (e.g., people posting to beware of a certain
neighborhood)
Discovery of correlated trends (e.g., finding that people posting about a certain topic in an
area correlates to higher crime in that area)
Tracking sentiment on certain topics and issues
Tracking language usage in areas to determine abnormal language presence in an area
UNCLASSIFIED 13
12. Quantitative Data Competencies
How can we infer movement patterns from vast amounts of what appears to
be just point data collected in time and associated with an identifier (IE:
UserId / bank account / etc)?
Technique is applicable to Twitter, FourSquare and MANY other sources
Volume plot of photos binned by area on log scale
Paris as seen from Flickr over all time
14
13. Quantitative Data Competencies
1. Goal: to catch active moment between locations a small distance apart
2. Typically two to around a dozen points chained together
3. Located in a small area, but with a definite path through the area
4. Sampled in rapid succession (less than X seconds between points)
5. Thousands or millions of micro-paths make a full path to view
Segment ignored:
Segment ignored: Velocity too fast
Photo taken 120 seconds between points
Photo taken Photo taken
2012-08-15 12:35:25
2012-08-15 12:34:59 2012-08-15 12:37:46
Photo taken
2012-08-15 12:37:35
Photo taken
2012-08-15 12:35:11 Person A Common
Photo taken path
10 seconds
2012-08-15 12:37:25
Person B
3 seconds
pattern
A Micropath example forming
Person C
Overlay thousands / millions of these tiny micropaths together
and you get
UNCLASSIFIED 15
14. Quantitative Data Competencies
View of Paris using a 60 second segment timeout and 80km/hour cutoff on Flickr data
Arc de Triomphe
Apparent typical approach pathway to the Arc
Place de la Concorde
Louvre
Harder to see, but
Place de la
you can see the
Eiffel Tower Concorde typically
typical approach /
approached from
exit pathways from
southern direction
Notre Dame.
Notre Dame
Red strip appears to
be line of sight to
the Eiffel Tower
UNCLASSIFIED 16
15. Quantitative Data Competencies
Aggregate micro-pathing on a world of photo metadata with no speed,
time, or distance restrictions
UNCLASSIFIED 17
16. Quantitative Data Competencies
AIS ship tracking micro-path blanket with no time / space filters
Japans south coast
Chinas coast with
high levels of activity
Coast of Taiwan
UNCLASSIFIED 18
17. Quantitative Data Competencies
Flickr Paris 2004 changes vs 2005
Hh: [HIGH, high]- an increase between Xt1 -> Xt2 relative to respective (Xt1, Xt2)
reference distribution where t1, t2 belong to T. HIGH reflects a strong increase
of ones own values (dxi) at location i between t1 and t2 relative to the change
of neighboring values (dy). high reflects a modest increase of dy relative to
values of dx. Neighbors are defined with the spatially lagged variable Wy, as
the eight nearest observations.
lL: low, LOW [low, LOW]- a decrease between Xt1 -> Xt2 relative to respective
(Xt1, Xt2) reference distribution where t1, t2 belong to T. low reflects a modest
decrease of ones own values (dxi) at location i between t1 and t2 relative to the
change of neighboring values (dy). LOW reflects a strong decrease of
neighboring values of dx.
Neighbors are defined with the spatially lagged variable Wy, as the eight
nearest observations.
Flickr Paris 2011 changes vs 2010
UNCLASSIFIED 19
18. Quantitative Data Competencies
New Year provides lots of photos
Paris
Bastille Day
Recurrent red strips show the recurring
weekend
Number of distinct
photographers
Day in year
UNCLASSIFIED 20
19. Quantitative Data Competencies
5 day Carnival celebration
Caracas
Some interesting dates for low
volume activity Number of distinct
photographers
Day in year
Image from www.flickr.com/photos/globovision/6911554143
UNCLASSIFIED 21
20. Quantitative Data Competencies
Airline Flight Data Anomaly Detection
During an unusual event, such as a winter storm show below, the ARIMA still follows the
pattern but doesnt match as well. These areas where the red and black dont match are
where unusual events have occurred.
ZeroFill
40
0
02-13
Index
ZeroFill
40
0
02-13
Index
Plot of the count of
points where the
difference between the
expected number of
flights leaving an airport
based on the model and
the actual observed
number of flights was
statistically significant.
UNCLASSIFIED 22
21. Quantitative Data Competencies
Raw data file:
Each line is a comma separated list of values.
key1, timestamp, value Key1 2.4,3.4,0.99,
key2, timestamp, value Key2 3.4,4.3,1.0,0.6.
Cloud-backed ..
transformation
Vector file:
Each line has a key and a comma
separated list of values.
Correlation analytic
Implemented in:
key1 Key2 Key3 Key4
Python (RAM)
Key1 - 0.93 0.43 0.001 Hive
Key2 - - -0.5 -0.03 Mahout
Spark
Key3 - - - .32
Giraph
Key4 - - - - Cascalog
For each vector calculate the correlation to
each other vector. We use a Pearson
correlation.
UNCLASSIFIED 23
22. Quantitative Data Competencies
Training Test Approximation engine for the O(n族) correlation
Engine Engine matrix problem
Spark
Technique based on Google Correlate
Approximation provides
orders of magnitude of
speedup when compared to
equivalent brute force
methods. The technique
works best for highly
correlated items and uses a
series of data
projections, unsupervised
learning, and vector
quantization to provide
dimensionality reduction for
incoming complex vectors.
UNCLASSIFIED 24