Hiring data scientists sure is expensive. One way to afford top talent is to stop throwing your money away on costly "big data" software that over-promises and under-delivers. This talk will offer an opinionated definition of data science, argue why free & open source software is usually the right choice for data scientists, and describe some of the leading free & open source software tools for data science available today.
1 of 37
Download to read offline
More Related Content
Put Down That Checkbook! - Big Data without the Big Bucks
2. Put Down That Checkbook!
Big Data without the Big Bucks
Charlie Greenbacker
Director of Data Science
Altamira Technologies Corporation
3. Agenda
?? What is a Data Scientist?
?? Why use Open Source Software (OSS)?
?? Survey of OSS Tools for Data Science
12. ¡°A data scientist is someone who understands
the domains of programming, machine learning,
data mining, statistics, and hacking¡±
Paul Cooper, ITProPortal.com
http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
13. Computer Programming
Mathematics & Analytic Methodology
Distributed Computing & Big Data
Data Science
StatisticalAnalysis
DataMining
MachineLearning
NaturalLanguageProcessing
SocialNetworkAnalysis
DataVisualization
Domain Knowledge & Communication Skills
etc.Altamira Technologies Corporation 2014
21. Statistical Analysis
Name: R
Creator: Gentleman, Ihaka, et al.
License: GPL Version 2
Website: r-project.org
Source: cran.us.r-project.org/src/base/
Features:
¨C? Language & environment for statistical computing & viz
¨C? Linear and nonlinear modeling, classical statistical tests, time-series
analysis, graphical techniques, and more¡
¨C? 5000+ packages available in CRAN repository
22. Data Mining
Name: Pandas
Creator: Wes McKinney, et al.
License: BSD 3-Clause License
Website: pandas.pydata.org
Source: github.com/pydata/pandas
Features:
¨C? Data analysis workflow in Python
¨C? DataFrame object for fast manipulation & indexing
¨C? Tools for reading & writing data between formats
¨C? Label-based slicing, indexing, and subsetting of data
23. Data Mining
Name: Impala
Creator: Cloudera
License: Apache License 2.0
Website: impala.io
Source: github.com/cloudera/impala
Features:
¨C? MPP query engine implemented on Hadoop
¨C? Low latency, high concurrency SQL & BI queries
¨C? Same interfaces as Apache Hive, but ~24x faster
¨C? Written in C++; does not use MapReduce
25. Machine Learning
Name: Scikit-learn
Creator: Cournapeau, et al.
License: BSD 3-Clause License
Website: scikit-learn.org
Source: github.com/scikit-learn/scikit-learn
Features:
¨C? ML library for Python built on NumPy, SciPy, matplotlib
¨C? Support for classification, clustering, dimensionality reduction,
regression, model selection, preprocessing
¨C? SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
26. Machine Learning + NLP
Name: Mallet
Creator: UMass (McCallum, et al.)
License: Common Public License 1.0
Website: mallet.cs.umass.edu
Source: hg-iesl.cs.umass.edu/hg/mallet
Features:
¨C? Java-based ¡°Machine Learning for Language Toolkit¡±
¨C? Document classification, clustering, topic modeling, information
extraction & sequence tagging, etc.
¨C? Efficient implementation of LDA for topic modeling
27. Natural Language Processing
Name: NLTK
Creator: Bird, Loper, et al.
License: Apache License 2.0
Website: nltk.org
Source: github.com/nltk/nltk
Features:
¨C? Natural Language Toolkit for Python
¨C? Built-in support for dozens of corpora & trained models
¨C? Libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning
28. Natural Language Processing
Name: Stanford CoreNLP
Creator: Stanford NLP Group
License: GPL Version 2
Website: nlp.stanford.edu/software/corenlp.shtml
Source: github.com/stanfordnlp/CoreNLP
Features:
¨C? Suite of high-quality, Java-based NLP tools
¨C? Includes POS tagger, named entity recognizer, parser, coreference
resolution, sentiment analysis, SUTime, etc.
¨C? Includes models for English, Chinese, Arabic, German
29. NLP + Geospatial Analysis
Name: CLAVIN
Creator: Berico Technologies
License: Apache License 2.0
Website: clavin.io
Source: github.com/Berico-Technologies/CLAVIN
Features:
¨C? Extracts location names from text, resolves to gazetteer
¨C? Employs context-based geospatial entity resolution
¨C? ~75% accuracy, processes 1M documents per hour
¨C? Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
30. Social Network Analysis
Name: NetworkX
Creator: Los Alamos National Lab
License: BSD 3-Clause License
Website: networkx.github.io
Source: github.com/networkx/networkx
Features:
¨C? Python structures for graphs, digraphs, & multigraphs
¨C? Support for creating, manipulating, & analyzing the structure, dynamics,
& functions of complex networks
¨C? Provides standard graph algorithms & analysis metrics
31. Social Network Analysis
Name: Gephi
Creator: UTC France
License: GPL Version 3
Website: gephi.org
Source: github.com/gephi/gephi
Features:
¨C? Network analysis and visualization package for Java
¨C? Dynamic network analysis with temporal filtering
¨C? Metrics include: community detection, betweenness, closeness,
clustering coefficient, PageRank, etc.
32. Data Visualization
Name: D3.js
Creator: Mike Bostock
License: BSD 3-Clause License
Website: d3js.org
Source: github.com/mbostock/d3
Features:
¨C? JavaScript library based on HTML, SVG, and CSS
¨C? Binds data to DOM & enables transformations
¨C? ~200 examples, including: force-directed graphs, choropleths,
treemaps, dendrograms, animations, etc.
33. Fusion, Analysis, and Visualization
Name: Lumify
Creator: Altamira
License: Apache License 2.0
Website: lumify.io
Source: github.com/altamiracorp/lumify
Features:
¨C? Built on Hadoop, Storm, Accumulo, Elasticsearch, etc.
¨C? Integrates structured data, text, images, video
¨C? Cell-level security & access controls
¨C? Live, shared collaborative workspaces
35. Final Thought¡
Save your $$$ for:
People
¨C? salaries, training, etc.
Resources
¨C? hardware, AWS, etc.
Proprietary software
¨C? if no viable OSS
alternative exists
photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL
THOUGHT
Springer¡¯s