�ݺ�ߣ

DT Core Analytical Competencies
Data Engineering
⁻ Data Architecture Design and Development
⁻ Large Scale Enterprise Architecture and Design
⁻ Migrate, Extract, Transform, and Load Data
⁻ Spatial, Multi-Domain, and Cloud Base Data Services

Analytics – Quantitative
⁻ Data Transformation and Ingestion
⁻ Dissemination and Reporting Tools
⁻ Data Mining, Exploitation, and Correlation Tools
⁻ Spatial Data Mining and Geographic Knowledge Discovery

Data Tactics Corporation Proprietary and Confidential Material

DT Core Analytical Competencies
The Team:

Graduates of top tier universities to include
Stanford, Caltech and MIT as well as ties to these and
local universities.

Degrees include Mathematics, Computer
Science, Aeronautical
Engineering, Astrophysics, Electrical
Engineering, Mechanical Engineering, Statistics and
Social Sciences.

Competencies include data mining, machine
learning, statistics, spatial statistics, Bayesian
statistics, econometrics, computational geometry, spatial
econometrics, applied mathematics, theoretical
robotics, dynamic systems, control theory.

Foci include unsupervised cross-modal clustering
algorithms, principle component analysis, independent
component analysis, regression, spatial
regression, geographic weighted regression, zeroth order
processing, nonlinear optimization, autoregressive
models, time-series analysis, spatial regime models, HAC
models.

Technical Competencies include Data Tactics Corporation Proprietary and Confidential Material

Data Tactics Analytics Cell


Analytics Competencies

ZeroFill

40
• Time Series Analytics (i) (i)

0
• Applying the ARIMA model in a
02-13

Index

parallelized environment to
provide anomaly detection
• Correlation Analytics (ii)
• Brute force pairwise Pearson‟s
correlation over vectors in a
cloud-backed engine
• Aggregation Analytics (iii)
• Aggregate micro-pathing
• Repurposing data to analyze (ii)
and display movement
patterns
• Dwell time calculations
• Analytic to discover areas of
interest based on movement
activity
• Graph Analytics (iiii)
• Discovering social interaction
models and paradigms within (iii)
network data (iiii) 4

Analytics Competencies
• Directional Spatio-Temporal
Analytics (i) (i)

• Compare distributions with a focus on
changes in morphology of the
distribution and mobility of individual
observations within the distribution
over that same period of time over
space (Wy)
• Local Classification (ii)
• Non-self-similarities & self-similarities; (i)

within and between group
correlations.
• Ecological Analytics (ii)

• Regression Modeling
• Spatial Regression
• Spatial Regime Models
• HAC Models 5

Data Tactics Data Repository


Quantitative Data Competencies
• Proxy problems definition – Different problems lead to different questions, which lead to
different data sets. Confer acceptability of data source by the definition of the proxy problems.
• Key dimensions of variability – Key dimensions were targeted for collection such as
time, space, identifier, etc. However, different proxy problems require different key dimensions.
• Capturing scope – The following was explicitly captured:
• Data structure (E.G. graph relationship data vs. graph transaction data vs. dimensional data)
• Data timespan (if time is a dimension)
• Data geospatial footprint (if geospatial is a dimension)
• Data volume (both in total GB and also in total # of rows)
• Determining dataset overlap
• Capturing opinions - Current star ratings based on:
• Data consistency, volume, and persistence
• Data coverage (time and space)
• Data precision (time and space)
• Data “genuineness” (synthesized data is penalized)
• Data distribution (IE: we may have extremely precise geo-spatial data, but if there are only 40
unique geospatial points in the data, the geo-spatial aspects aren‟t that interesting)
• Data dimensionality (higher dimensionality with reasonable distributions on each dimension is
preferred)

Quantitative Data Holdings
Name of the Data Date that statistics
Source were last collected
Initial reviewer on data
Location of data Data
Opinion of Data Source where on FTP site format
Quality Collection start /
data was
Description and end dates – if
acquired Size of Data
notes on data source known (storage space
as well as collection Geospatial and rows)
Data handling
information coverage requirements

10

Quantitative Data Holdings
Armed Conflict Location and Events Dataset (ACLED) KDD 2003 Data
AIS Ship Data KDD 2005 Data
Atmospherics Reports Kiva Data
BrightKite Data Landscan Data
Classified Ads LiveJournal Data
CNN Meme Tracker
Digital Terrain Elevation Data (DTED) Meme Twitter TS
Enron Data NFL Plays
Epinions Data Night Lights Data
EU Email Open Data Airtraffic accidents
Facebook Open Street Maps
Flickr Data Panoramio Data
Flight Information Data Patent Citations Data
Four Square Data Photobucket Data
Friend Feed Data Picasa Web Albums Data
Geolife Data Processed Employment Data
Gowalla Data Scamper Data
International Conference on Weblogs and Social Media ISVG
(ICWSM) Data Twitter
Identica Data UNDP
IMDB Data Weather Data
Knowledge Discovery and Data (KDD) Mining Tools Webgraphs
Competition Youtube


Panoramio / Flickr – Metadata on uploaded public photos provides excellent geospatial and
temporal resolution, which also provides user information. Estimated 250 million rows of photo metadata
with over 150 million already gathered.
AIS – Ship tracking data that provides ship „pings‟ as they progress in movement. Precise time and
geospatial information provided. 50 million records and counting.
OpenStreetMaps – Over 2 billion geospatial points of mapping enthusiasts‟ tracks across the
world. Time and userid information also included.
Gowalla / Brightkite – About 11 million FourSquare style check-ins with user, location, and
time information provided.

Example Proxy Problems:
• Discovering “Holes” in the data where photos are no longer taken to detect avoided areas
• Discovering relationships and links based on co-occurrence between users in time / space
• Tracking and analyzing movement patterns on a local and global scale
• Analyzing image data for changes in the same locations
• Detecting differences in photo activity in an area over time
• Detecting events based on abnormal photo activity behavior
• Mapping UserIds across data sources to create a unified analytic picture
• Detecting home range for each user
• Defining patterns of life by routine activities and movement
• Tracking language usage in areas to determine abnormal language presence in an area
• Local vs tourist movement analysis and extraction
• Trending of location popularity

UNCLASSIFIED 12

Twitter – Sampled ongoing collection of social media tweets with UserId and time.
Some even have precise location data, but this is not the norm. Collection pulls roughly
between 1-2 million tweets / day.
Example Proxy Problems:
• Discovery of crowd-sourced phenomena (e.g., people posting to beware of a certain
neighborhood)
• Discovery of correlated trends (e.g., finding that people posting about a certain topic in an
area correlates to higher crime in that area)
• Tracking sentiment on certain topics and issues
• Tracking language usage in areas to determine abnormal language presence in an area

UNCLASSIFIED 13

• How can we infer movement patterns from vast amounts of what appears to
be just point data collected in time and associated with an identifier (IE:
UserId / bank account / etc)?
• Technique is applicable to Twitter, FourSquare and MANY other sources

Volume plot of photos binned by area on log scale
Paris as seen from Flickr over all time

14

1. Goal: to catch active moment between locations a small distance apart
2. Typically two to around a dozen points chained together
3. Located in a small area, but with a definite path through the area
4. Sampled in rapid succession (less than X seconds between points)
5. Thousands or millions of micro-paths make a full path to view
Segment ignored:
Segment ignored: Velocity too fast
Photo taken 120 seconds between points
Photo taken Photo taken
2012-08-15 12:35:25
2012-08-15 12:34:59 2012-08-15 12:37:46
Photo taken
2012-08-15 12:37:35

Photo taken
2012-08-15 12:35:11 Person A Common
Photo taken path

10 seconds
2012-08-15 12:37:25
Person B

3 seconds
pattern
A Micropath example forming
Person C

Overlay thousands / millions of these tiny micropaths together
and you get…
UNCLASSIFIED 15

View of Paris using a 60 second segment timeout and 80km/hour cutoff on Flickr data
Arc de Triomphe

Apparent typical approach pathway to the Arc

Place de la Concorde

Louvre

Harder to see, but
Place de la
you can see the
Eiffel Tower Concorde typically
typical approach /
approached from
exit pathways from
southern direction
Notre Dame.
Notre Dame

Red strip appears to
be line of sight to
the Eiffel Tower

UNCLASSIFIED 16

Aggregate micro-pathing on a world of photo metadata with no speed,
time, or distance restrictions

UNCLASSIFIED 17

AIS ship tracking micro-path blanket with no time / space filters

Japan‟s south coast

China‟s coast with
high levels of activity

Coast of Taiwan

UNCLASSIFIED 18

Flickr Paris 2004 changes vs 2005
Hh: [HIGH, high]- an increase between Xt1 -> Xt2 relative to respective (Xt1, Xt2)
reference distribution where t1, t2 belong to T. HIGH reflects a strong increase
of ones own values (dxi) at location i between t1 and t2 relative to the change
of neighboring values (dy). high reflects a modest increase of dy relative to
values of dx. Neighbors are defined with the spatially lagged variable Wy, as
the eight nearest observations.

lL: low, LOW [low, LOW]- a decrease between Xt1 -> Xt2 relative to respective
(Xt1, Xt2) reference distribution where t1, t2 belong to T. low reflects a modest
decrease of ones own values (dxi) at location i between t1 and t2 relative to the
change of neighboring values (dy). LOW reflects a strong decrease of
neighboring values of dx.

Neighbors are defined with the spatially lagged variable Wy, as the eight
nearest observations.
Flickr Paris 2011 changes vs 2010

UNCLASSIFIED 19

New Year provides lots of photos
Paris
Bastille Day
Recurrent red strips show the recurring
weekend
Number of distinct
photographers

Day in year
UNCLASSIFIED 20

5 day Carnival celebration
Caracas
Some interesting dates for low
volume activity Number of distinct
photographers

Day in year
Image from www.flickr.com/photos/globovision/6911554143
UNCLASSIFIED 21

Airline Flight Data Anomaly Detection
During an unusual event, such as a winter storm show below, the ARIMA still follows the
pattern but doesn‟t match as well. These areas where the red and black don‟t match are
where unusual events have occurred.
ZeroFill

40
0

02-13

Index
ZeroFill

40
0

02-13

Index

Plot of the count of
points where the
difference between the
expected number of
flights leaving an airport
based on the model and
the actual observed
number of flights was
statistically significant.
UNCLASSIFIED 22

Raw data file:
Each line is a comma separated list of values.

key1, timestamp, value Key1 2.4,3.4,0.99,…
key2, timestamp, value Key2 3.4,4.3,1.0,0.6….
Cloud-backed …..
…
transformation
Vector file:
Each line has a key and a comma
separated list of values.
Correlation analytic

Implemented in:
key1 Key2 Key3 Key4
• Python (RAM)
Key1 - 0.93 0.43 0.001 • Hive
Key2 - - -0.5 -0.03 • Mahout
• Spark
Key3 - - - .32
• Giraph
Key4 - - - - • Cascalog
For each vector calculate the correlation to
each other vector. We use a Pearson
correlation.

UNCLASSIFIED 23

Training Test Approximation engine for the O(n²) correlation
Engine Engine matrix problem

Spark
Technique based on Google Correlate

Approximation provides
orders of magnitude of
speedup when compared to
equivalent brute force
methods. The technique
works best for highly
correlated items and uses a
series of data
projections, unsupervised
learning, and vector
quantization to provide
dimensionality reduction for
incoming complex vectors.

UNCLASSIFIED 24

�ݺ�ߣ

Capabilities Brief Analytics

More Related Content

Capabilities Brief Analytics