ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Put Down That Checkbook! - Big Data without the Big Bucks
Put Down That Checkbook!
Big Data without the Big Bucks
Charlie Greenbacker
Director of Data Science
Altamira Technologies Corporation
Agenda
?? What is a Data Scientist?
?? Why use Open Source Software (OSS)?
?? Survey of OSS Tools for Data Science
About me: @greenbacker
Theories: popular tripe
Methods: sloppy
Conclusions: highly questionable photo: Columbia Pictures
Best reason for
not finishing PhD
@ExploreAltamira
WHAT IS A DATA SCIENTIST?
Put Down That Checkbook! - Big Data without the Big Bucks
Put Down That Checkbook! - Big Data without the Big Bucks
Put Down That Checkbook! - Big Data without the Big Bucks
credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
¡°A data scientist is someone who understands
the domains of programming, machine learning,
data mining, statistics, and hacking¡±
Paul Cooper, ITProPortal.com
http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
Computer Programming
Mathematics & Analytic Methodology
Distributed Computing & Big Data
Data Science
StatisticalAnalysis
DataMining
MachineLearning
NaturalLanguageProcessing
SocialNetworkAnalysis
DataVisualization
Domain Knowledge & Communication Skills
etc.Altamira Technologies Corporation 2014
WHY USE OSS?
What is Open Source Software (OSS)?
The Open Source Definition:
1.? Free Redistribution
2.? Source Code
3.? Derived Works
more: opensource.org
WHY USE OSS?
photo: Karen (https://flic.kr/p/5njby2)
THERE ARE NO SILVER BULLETS."
photo: Paul Inkles (https://flic.kr/p/e2QMS5)
IF YOUR BOSS BUYS SOMETHING,"
YOU DAMN WELL BETTER USE IT."
photo: Valugi (http://bit.ly/1jrvVBC)
BUDGETS DON¡¯T SCALE."
SURVEY OF OSS TOOLS
FOR DATA SCIENCE
Statistical Analysis
Name: R
Creator: Gentleman, Ihaka, et al.
License: GPL Version 2
Website: r-project.org
Source: cran.us.r-project.org/src/base/
Features:
¨C? Language & environment for statistical computing & viz
¨C? Linear and nonlinear modeling, classical statistical tests, time-series
analysis, graphical techniques, and more¡­
¨C? 5000+ packages available in CRAN repository
Data Mining
Name: Pandas
Creator: Wes McKinney, et al.
License: BSD 3-Clause License
Website: pandas.pydata.org
Source: github.com/pydata/pandas
Features:
¨C? Data analysis workflow in Python
¨C? DataFrame object for fast manipulation & indexing
¨C? Tools for reading & writing data between formats
¨C? Label-based slicing, indexing, and subsetting of data
Data Mining
Name: Impala
Creator: Cloudera
License: Apache License 2.0
Website: impala.io
Source: github.com/cloudera/impala
Features:
¨C? MPP query engine implemented on Hadoop
¨C? Low latency, high concurrency SQL & BI queries
¨C? Same interfaces as Apache Hive, but ~24x faster
¨C? Written in C++; does not use MapReduce
Machine Learning
Name: Mahout
Creator: ASF
License: Apache License 2.0
Website: mahout.apache.org
Source: svn.apache.org/viewvc/mahout
Features:
¨C? Distributed/scalable ML library for Hadoop
¨C? Classification, Clustering, Collaborative filtering
¨C? Logistic regression, na?ve Bayes, random forest, neural networks, HMM,
k-means, SVD, PCA, ALS, LDA, etc.
Machine Learning
Name: Scikit-learn
Creator: Cournapeau, et al.
License: BSD 3-Clause License
Website: scikit-learn.org
Source: github.com/scikit-learn/scikit-learn
Features:
¨C? ML library for Python built on NumPy, SciPy, matplotlib
¨C? Support for classification, clustering, dimensionality reduction,
regression, model selection, preprocessing
¨C? SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
Machine Learning + NLP
Name: Mallet
Creator: UMass (McCallum, et al.)
License: Common Public License 1.0
Website: mallet.cs.umass.edu
Source: hg-iesl.cs.umass.edu/hg/mallet
Features:
¨C? Java-based ¡°Machine Learning for Language Toolkit¡±
¨C? Document classification, clustering, topic modeling, information
extraction & sequence tagging, etc.
¨C? Efficient implementation of LDA for topic modeling
Natural Language Processing
Name: NLTK
Creator: Bird, Loper, et al.
License: Apache License 2.0
Website: nltk.org
Source: github.com/nltk/nltk
Features:
¨C? Natural Language Toolkit for Python
¨C? Built-in support for dozens of corpora & trained models
¨C? Libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning
Natural Language Processing
Name: Stanford CoreNLP
Creator: Stanford NLP Group
License: GPL Version 2
Website: nlp.stanford.edu/software/corenlp.shtml
Source: github.com/stanfordnlp/CoreNLP
Features:
¨C? Suite of high-quality, Java-based NLP tools
¨C? Includes POS tagger, named entity recognizer, parser, coreference
resolution, sentiment analysis, SUTime, etc.
¨C? Includes models for English, Chinese, Arabic, German
NLP + Geospatial Analysis
Name: CLAVIN
Creator: Berico Technologies
License: Apache License 2.0
Website: clavin.io
Source: github.com/Berico-Technologies/CLAVIN
Features:
¨C? Extracts location names from text, resolves to gazetteer
¨C? Employs context-based geospatial entity resolution
¨C? ~75% accuracy, processes 1M documents per hour
¨C? Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
Social Network Analysis
Name: NetworkX
Creator: Los Alamos National Lab
License: BSD 3-Clause License
Website: networkx.github.io
Source: github.com/networkx/networkx
Features:
¨C? Python structures for graphs, digraphs, & multigraphs
¨C? Support for creating, manipulating, & analyzing the structure, dynamics,
& functions of complex networks
¨C? Provides standard graph algorithms & analysis metrics
Social Network Analysis
Name: Gephi
Creator: UTC France
License: GPL Version 3
Website: gephi.org
Source: github.com/gephi/gephi
Features:
¨C? Network analysis and visualization package for Java
¨C? Dynamic network analysis with temporal filtering
¨C? Metrics include: community detection, betweenness, closeness,
clustering coefficient, PageRank, etc.
Data Visualization
Name: D3.js
Creator: Mike Bostock
License: BSD 3-Clause License
Website: d3js.org
Source: github.com/mbostock/d3
Features:
¨C? JavaScript library based on HTML, SVG, and CSS
¨C? Binds data to DOM & enables transformations
¨C? ~200 examples, including: force-directed graphs, choropleths,
treemaps, dendrograms, animations, etc.
Fusion, Analysis, and Visualization
Name: Lumify
Creator: Altamira
License: Apache License 2.0
Website: lumify.io
Source: github.com/altamiracorp/lumify
Features:
¨C? Built on Hadoop, Storm, Accumulo, Elasticsearch, etc.
¨C? Integrates structured data, text, images, video
¨C? Cell-level security & access controls
¨C? Live, shared collaborative workspaces
Put Down That Checkbook! - Big Data without the Big Bucks
Final Thought¡­
Save your $$$ for:
People
¨C? salaries, training, etc.
Resources
¨C? hardware, AWS, etc.
Proprietary software
¨C? if no viable OSS
alternative exists
photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL
THOUGHT
Springer¡¯s
open source software for data scientists
oss4ds.com
Charlie Greenbacker
@greenbacker | oss4ds.com

More Related Content

Put Down That Checkbook! - Big Data without the Big Bucks

  • 2. Put Down That Checkbook! Big Data without the Big Bucks Charlie Greenbacker Director of Data Science Altamira Technologies Corporation
  • 3. Agenda ?? What is a Data Scientist? ?? Why use Open Source Software (OSS)? ?? Survey of OSS Tools for Data Science
  • 4. About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable photo: Columbia Pictures
  • 5. Best reason for not finishing PhD
  • 7. WHAT IS A DATA SCIENTIST?
  • 11. credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
  • 12. ¡°A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking¡± Paul Cooper, ITProPortal.com http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
  • 13. Computer Programming Mathematics & Analytic Methodology Distributed Computing & Big Data Data Science StatisticalAnalysis DataMining MachineLearning NaturalLanguageProcessing SocialNetworkAnalysis DataVisualization Domain Knowledge & Communication Skills etc.Altamira Technologies Corporation 2014
  • 15. What is Open Source Software (OSS)? The Open Source Definition: 1.? Free Redistribution 2.? Source Code 3.? Derived Works more: opensource.org
  • 18. photo: Paul Inkles (https://flic.kr/p/e2QMS5) IF YOUR BOSS BUYS SOMETHING," YOU DAMN WELL BETTER USE IT."
  • 20. SURVEY OF OSS TOOLS FOR DATA SCIENCE
  • 21. Statistical Analysis Name: R Creator: Gentleman, Ihaka, et al. License: GPL Version 2 Website: r-project.org Source: cran.us.r-project.org/src/base/ Features: ¨C? Language & environment for statistical computing & viz ¨C? Linear and nonlinear modeling, classical statistical tests, time-series analysis, graphical techniques, and more¡­ ¨C? 5000+ packages available in CRAN repository
  • 22. Data Mining Name: Pandas Creator: Wes McKinney, et al. License: BSD 3-Clause License Website: pandas.pydata.org Source: github.com/pydata/pandas Features: ¨C? Data analysis workflow in Python ¨C? DataFrame object for fast manipulation & indexing ¨C? Tools for reading & writing data between formats ¨C? Label-based slicing, indexing, and subsetting of data
  • 23. Data Mining Name: Impala Creator: Cloudera License: Apache License 2.0 Website: impala.io Source: github.com/cloudera/impala Features: ¨C? MPP query engine implemented on Hadoop ¨C? Low latency, high concurrency SQL & BI queries ¨C? Same interfaces as Apache Hive, but ~24x faster ¨C? Written in C++; does not use MapReduce
  • 24. Machine Learning Name: Mahout Creator: ASF License: Apache License 2.0 Website: mahout.apache.org Source: svn.apache.org/viewvc/mahout Features: ¨C? Distributed/scalable ML library for Hadoop ¨C? Classification, Clustering, Collaborative filtering ¨C? Logistic regression, na?ve Bayes, random forest, neural networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
  • 25. Machine Learning Name: Scikit-learn Creator: Cournapeau, et al. License: BSD 3-Clause License Website: scikit-learn.org Source: github.com/scikit-learn/scikit-learn Features: ¨C? ML library for Python built on NumPy, SciPy, matplotlib ¨C? Support for classification, clustering, dimensionality reduction, regression, model selection, preprocessing ¨C? SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
  • 26. Machine Learning + NLP Name: Mallet Creator: UMass (McCallum, et al.) License: Common Public License 1.0 Website: mallet.cs.umass.edu Source: hg-iesl.cs.umass.edu/hg/mallet Features: ¨C? Java-based ¡°Machine Learning for Language Toolkit¡± ¨C? Document classification, clustering, topic modeling, information extraction & sequence tagging, etc. ¨C? Efficient implementation of LDA for topic modeling
  • 27. Natural Language Processing Name: NLTK Creator: Bird, Loper, et al. License: Apache License 2.0 Website: nltk.org Source: github.com/nltk/nltk Features: ¨C? Natural Language Toolkit for Python ¨C? Built-in support for dozens of corpora & trained models ¨C? Libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
  • 28. Natural Language Processing Name: Stanford CoreNLP Creator: Stanford NLP Group License: GPL Version 2 Website: nlp.stanford.edu/software/corenlp.shtml Source: github.com/stanfordnlp/CoreNLP Features: ¨C? Suite of high-quality, Java-based NLP tools ¨C? Includes POS tagger, named entity recognizer, parser, coreference resolution, sentiment analysis, SUTime, etc. ¨C? Includes models for English, Chinese, Arabic, German
  • 29. NLP + Geospatial Analysis Name: CLAVIN Creator: Berico Technologies License: Apache License 2.0 Website: clavin.io Source: github.com/Berico-Technologies/CLAVIN Features: ¨C? Extracts location names from text, resolves to gazetteer ¨C? Employs context-based geospatial entity resolution ¨C? ~75% accuracy, processes 1M documents per hour ¨C? Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
  • 30. Social Network Analysis Name: NetworkX Creator: Los Alamos National Lab License: BSD 3-Clause License Website: networkx.github.io Source: github.com/networkx/networkx Features: ¨C? Python structures for graphs, digraphs, & multigraphs ¨C? Support for creating, manipulating, & analyzing the structure, dynamics, & functions of complex networks ¨C? Provides standard graph algorithms & analysis metrics
  • 31. Social Network Analysis Name: Gephi Creator: UTC France License: GPL Version 3 Website: gephi.org Source: github.com/gephi/gephi Features: ¨C? Network analysis and visualization package for Java ¨C? Dynamic network analysis with temporal filtering ¨C? Metrics include: community detection, betweenness, closeness, clustering coefficient, PageRank, etc.
  • 32. Data Visualization Name: D3.js Creator: Mike Bostock License: BSD 3-Clause License Website: d3js.org Source: github.com/mbostock/d3 Features: ¨C? JavaScript library based on HTML, SVG, and CSS ¨C? Binds data to DOM & enables transformations ¨C? ~200 examples, including: force-directed graphs, choropleths, treemaps, dendrograms, animations, etc.
  • 33. Fusion, Analysis, and Visualization Name: Lumify Creator: Altamira License: Apache License 2.0 Website: lumify.io Source: github.com/altamiracorp/lumify Features: ¨C? Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¨C? Integrates structured data, text, images, video ¨C? Cell-level security & access controls ¨C? Live, shared collaborative workspaces
  • 35. Final Thought¡­ Save your $$$ for: People ¨C? salaries, training, etc. Resources ¨C? hardware, AWS, etc. Proprietary software ¨C? if no viable OSS alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ) FINAL THOUGHT Springer¡¯s
  • 36. open source software for data scientists oss4ds.com