際際滷

際際滷Share a Scribd company logo
predictive analytics!
the future of predicting the future

             Vedant Misra
        vedant.misra@gmail.com

         Boston BarCamp 2011
the big picture
 We are witnessing a data explosion.

   "Everywhere you look, the quantity of
    information in the world is soaring.
According to one estimate, mankind created
       150 exabytes of data in 2005.
  This year, it will create 1,200 exabytes."

              The Data Deluge. The Economist, Feb 25, 2010.


                         P.S. 1 exabyte is 1 million terabytes.
the big picture
We are witnessing a data explosion.

 "we create as much information* in two
days now as we did from the dawn of man
             through 2003"
                             -Larry Page, CEO, Google


                *This is mostly lolcats and duckface photos.
the problem
     data


  information


  knowledge
modus operandi
1.ngest data
  I
    ≒ tructured
     s
    ≒ nstructured
     u
2. igest data
  D
    ≒ LP
     N
    ≒ ntity extraction
     e
3. pit data back up
  S
    ≒ isualization
     v
    ≒ederated search
     f
the state of the art



    Omniture, Stratify, Jedox, Bime, Kosmix, I2, SpotFire, Quid
Scoremind, Birst, Predixion Software, PivotLink, GoodData, Endeca,
         FSI, Informatica, IBM, Kofax, SPSS, Data Applied,
 Mathematica, Matlab, Octave, R, Stata, Statistica, ROOT, Geant,
  Attensity360, Sysomos, SAS, ISS CIDNE, Centrifuge Systems,
Prediction Company, CASA, Info Mesa, FreeBase, YouCalc, Inxight
Palantir.
Digital Reasoning
IBM DeepQA
ingesting data
≒ tructured information
 s
      ≒ xplicitly defined format
       e
      ≒ elationships are clear
       r
      ≒ SVs, relational
       C
      databases, XLS
≒ nstructured information
 u
      ≒ o data model
       n
      ≒ ixed text, numbers,
       m
      figures
      ≒ mails, webpages,
       e
      books, health records,
      call logs, phone
      recordings, video footage
digesting data
≒ o NLP
 D
    ≒okenize
     t
    ≒ etermine POS
     d
    ≒emmatize
     l
≒ xtract entities
 E
≒ ategorize entities
 C
using a dynamic
ontology
≒ eographical tagging
 G
≒ ssociative net
 A
spitting up data




≒ owerful visualizations
 p
≒ederated search
 f
    ≒ eospatial, spatial, temporal
     g
    ≒ ersistent background search (alerts)
     p
complications
≒ igh-resolution access control
  h
≒ ource, date, location, and other
  s
metadata for tracking pedigree and
lineage
≒ dding insight and new data back into
  a
data layer
≒ ir-gapped networks
  a
≒ evisioning databases
  r
≒ eal-time hypothesis and intuition
  r
sharing
what's left?
≒ eep analytics: platforms that
 d
understand
≒ eplacing IA with AI
 r
≒ ven fancier statistical methods
 e
    naive Bayes classifier, support vector machine, kernel
        estimation, neural networks, k-nearest neighbor,
k-means clustering, kernel PCA, hierarchical clustering, linear
  regression, neural networks, gaussian process regression,
    principal component analysis, independent component
 analysis, hidden Markov models, maximum entropy Markov
             models, Kalman filters, particle filters,
     Bayesian networks, Markov random fields, bootstrap
              aggregating, ensemble averaging...
what's left?
≒ ore science of prediction:
 m
   ≒ odelling and validation
    m
   ≒ enetic algorithms for finding
    g
   symbolic expressions
≒ hen are systems unpredictable?
 w
≒ escribing groups with game
 d
theory
≒ hen is individual behavior
 w
important?
thanks!

More Related Content

Predictive Analytics - BarCamp Boston 2011

  • 1. predictive analytics! the future of predicting the future Vedant Misra vedant.misra@gmail.com Boston BarCamp 2011
  • 2. the big picture We are witnessing a data explosion. "Everywhere you look, the quantity of information in the world is soaring. According to one estimate, mankind created 150 exabytes of data in 2005. This year, it will create 1,200 exabytes." The Data Deluge. The Economist, Feb 25, 2010. P.S. 1 exabyte is 1 million terabytes.
  • 3. the big picture We are witnessing a data explosion. "we create as much information* in two days now as we did from the dawn of man through 2003" -Larry Page, CEO, Google *This is mostly lolcats and duckface photos.
  • 4. the problem data information knowledge
  • 5. modus operandi 1.ngest data I ≒ tructured s ≒ nstructured u 2. igest data D ≒ LP N ≒ ntity extraction e 3. pit data back up S ≒ isualization v ≒ederated search f
  • 6. the state of the art Omniture, Stratify, Jedox, Bime, Kosmix, I2, SpotFire, Quid Scoremind, Birst, Predixion Software, PivotLink, GoodData, Endeca, FSI, Informatica, IBM, Kofax, SPSS, Data Applied, Mathematica, Matlab, Octave, R, Stata, Statistica, ROOT, Geant, Attensity360, Sysomos, SAS, ISS CIDNE, Centrifuge Systems, Prediction Company, CASA, Info Mesa, FreeBase, YouCalc, Inxight
  • 10. ingesting data ≒ tructured information s ≒ xplicitly defined format e ≒ elationships are clear r ≒ SVs, relational C databases, XLS ≒ nstructured information u ≒ o data model n ≒ ixed text, numbers, m figures ≒ mails, webpages, e books, health records, call logs, phone recordings, video footage
  • 11. digesting data ≒ o NLP D ≒okenize t ≒ etermine POS d ≒emmatize l ≒ xtract entities E ≒ ategorize entities C using a dynamic ontology ≒ eographical tagging G ≒ ssociative net A
  • 12. spitting up data ≒ owerful visualizations p ≒ederated search f ≒ eospatial, spatial, temporal g ≒ ersistent background search (alerts) p
  • 13. complications ≒ igh-resolution access control h ≒ ource, date, location, and other s metadata for tracking pedigree and lineage ≒ dding insight and new data back into a data layer ≒ ir-gapped networks a ≒ evisioning databases r ≒ eal-time hypothesis and intuition r sharing
  • 14. what's left? ≒ eep analytics: platforms that d understand ≒ eplacing IA with AI r ≒ ven fancier statistical methods e naive Bayes classifier, support vector machine, kernel estimation, neural networks, k-nearest neighbor, k-means clustering, kernel PCA, hierarchical clustering, linear regression, neural networks, gaussian process regression, principal component analysis, independent component analysis, hidden Markov models, maximum entropy Markov models, Kalman filters, particle filters, Bayesian networks, Markov random fields, bootstrap aggregating, ensemble averaging...
  • 15. what's left? ≒ ore science of prediction: m ≒ odelling and validation m ≒ enetic algorithms for finding g symbolic expressions ≒ hen are systems unpredictable? w ≒ escribing groups with game d theory ≒ hen is individual behavior w important?