際際滷

際際滷Share a Scribd company logo
Sandish Kumar H N
sanysandish@gmail.com
+919008990742, skype: sandishhadoop
SUMMARY:
I'm Senior Big Data Consultant with 4+ years of experience in multiple business domains using latest
technologies and platforms. Focus is on Cloud, Big Data, Machine Learning, Data Science and Data Mining. A
skilled developer, architect with strong problem solving, debugging and analytical capabilities, who creates
the technical vision and actively engages in understanding customer requirements. I focus particularly in
software performance and efficiency. Result oriented and hands on, who skillfully balances between meeting
resource and time constraints, while doing it right.
 Strong experience creating real time data streaming solutions using Apache Spark Core, Spark SQL &
Data Frames, Spark Streaming, Apache Storm, Kafka.
 Experience in building Data-pipe lines using Big Data Technologies
 Hands-on experience in writing MapReduce programs and user-defined functions for Hive and Pig
 Experience in NoSQL technologies like HBase, Cassandra
 Excellent understanding /knowledge on Hadoop (Gen-1 and Gen-2) and various components such as
HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager (YARN).
 Excellent understanding and knowledge of NOSQL databases like HBase, and Cassandra.
 Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems
(RDBMS) and from RDBMS to HDFS.
 Proficient at using Spark APIs to cleanse, explore, aggregate, transform, and store machine sensor
data
 Configured a 20-30 node (Amazon EC2 spot Instance) Hadoop cluster to transfer the data from
Amazon S3 to HDFS and HDFS to Amazon S3 and also to direct input and output to the Hadoop
MapReduce framework.
 Hands-on experience with systems-building languages such as Scala, Java
 Hands-on experience with message brokers such as Apache Kafka and RabbitMQ.
 Worked extensively with Dimensional modeling, Data migration, Data cleansing, Data profiling, and
ETL Processes features for data warehouses.
 Implemented Hadoop based data warehouses, integrated Hadoop with Enterprise Data Warehouse
systems
 Built real-time Big Data solutions using HBASE handling billions of records.
 Involved in designing the data model in Hive for migrating the ETL process into Hadoop and wrote Pig
Scripts to load data into Hadoop environment.
 Expertise in writing Hive UDF, Generic UDFs to incorporate complex business logic into hive queries
in the process of performing high level data analysis.
 Worked on Spark Machine Learning library for Recommendations, Coupons Recommendations, Rules
Engine.
 Experience in working with various Cloudera distributions (CDH4/CDH5) and have knowledge on
Hortonworks and Amazon EMR Hadoop Distributions.
 Experience in administering large scale Hadoop environments including design, configuration,
installation, performance tuning and monitoring of the cluster using Cloudera manager and ganglia.
 Worked extensively with Dimensional modeling, Data migration, Data cleansing, Data profiling, and
ETL Processes features for data warehouses
 Experience in Object Oriented Analysis Design (OOAD) and development of software using UML
Methodology, good knowledge of J2EE design patterns and Core Java design patterns.
 Experience in designing both time driven and data driven automated workflows using Oozie.
 Experience in writing UNIX shell scripts.
CERTIFICATION:
 DataBricks Certified Apache Spark Developer
 MapR Certified Apache Hadoop Developer
 DataStax Certified Apache Cassandra Developer
TECHNICAL HIGHLIGHTS
 Big Data Technologies: Hadoop (Horton works, Cloudera, MapR), Spark, Spark Streaming, Spark Sql,
Spark ML, Map reduce, HDFS, Cassandra, Storm, Apache Kafka, Flume, Oozie, Solr, Zookeeper,
Solr, Tez, Data Modelling, Pig, Hive, Impala, Drill, Sqoop and RabbitMQ.
 NO SQL Database: Hbase, Cassandra,
 SQL DBs: Hive, Pig, PrestoDB, Impala, SparkQL
 Search: Hsearch, Apache Blur, Lucene, Elastic Search, Nutch
 Programming Languages: Java, Scala, Basics (Python, Cloujer)
 Cloud Platform: Amazon Web Services (EC2, Amazon Elastic Map Reduce, Amazon S3),
Google Cloud Platform (Big Query, App Engine, Compute Engine, CloudSQL), Rack Space (CDN,
Servers, Storage), Linode Manager
 Monitoring and Reporting: Ganglia, Nagios, Custom Shell scripts, Tableau, D3.js, Google Charts
 Data: E-Commerce, Social Media, Logs and click events data, Next Generation Genomic Data, Oil &
Gas, Health Care, Travel
 Other: HTML, JavaScript, Extjs, CSS, JQuery
WORK EXPERIENCE:
Senior Big Data Consultant
Third Eye Consulting Services & Solutions -Bangalore, India Dec 2013 to present
Project Name: Yardstick- Spark (Open Source)
Client: Grid Gain
Project Description: Yardstick Apache Spark is a set of apache Spark and apache Ignite comparative
benchmarks written on top of Yardstick framework.
Responsibilities:
 I have successfully written Spark CoreRDD application to read auto generated 1 billion records and
compare with IgniteRDD in Yardstick framework to measure performance of Apache Ignite RDD and
Apache Spark RDD.
 I have successfully written Spark DataFrame application to read from HDFS and analyze 10 million
twitter records using Yardstick framework to measure performance of Apache Ignite SQL and Apache
Spark DataFrame.
 I have successfully written Spark Streaming application to read streaming twitter data and analyze
twitter records in real -time using Yardstick framework to measure performance of Apache Ignite
Streaming and Apache Spark Streaming.
 Implemented test cases for Spark and Ignite functions using Scala as language.
 Hands--on experience in setting up 10 node Spark cluster on Amazon Web Services using Spark EC2
script.
 Implemented D3.js and Tableau charts to show performance difference between Apache
Ignite and Apache Spark.
Environment: Spark, Spark Core, Data Frames, Spark Streaming, Scala, HDFS, Apache Ignite, Yardstick Tool,
D3js, Tableau, AWS, 10 million twitter data records and 1 billion auto generated records.
Project Name: E-Commerce Data Pipe Line
Client: Obsessory.com
Project Description: Obsessory is a technology company that provides a web and mobile platform to assist
shoppers in discovery, search, comparison, and tracking of items across the Internet. Obsessorys powerful
search engine catalogs millions of products from online stores on a daily basis and uses proprietary
algorithms to enhance the depth and breadth of the users search. Obsessory employs adaptive and social
learning to continuously refine the search results and present the user with the most relevant selection of
items. Furthermore, Obsessory helps users keep track of desired items across the Internet and get notified of
price changes, and availability of tracked items as well as sales, store events and promotions
Responsibilities:
Pre-Processing:
 Crawling of 100+ sites Data using Nutch
 Fashion based ontology maintenance
 Using Scala, Spark & echo system to enriched given data using Fashion Ontology to
Validation/Normalizing the data
 Designed schema and modeling of data and Written algorithm to store all validated
data in Cassandra using Spring Data Cassandra REST
 Programs for Validation/Normalizing/Enriching and REST API to Develop UI Based on
manual QA Validation. Used SparkSQL, Scala to running QA based SQL queries.
Indexing:
MR Programs on top of Hbase:
 To standardize the Input Merchants data
 To upload images to RackSpace CDN
 To index the given Data sets into HSearch
 To MR programs on Hbase to extract the color information from Images
including density.
 To MR programs on Hbase to persist the data on Hbase tables
 above MR jobs will run based on timing and bucketing.
Color-Obsessed:
Using Image color and density data Users are allowed to select 1,2.. colors with different
densities and result will be a list of products where each product image contains all give
colors with exact density this has been implemented on top Hbase using Spring REST web
service for color Obsessed search API.
Post-Processing:
 Setting up the Spark Streaming and Kafka Cluster
 Developed a Spark Streaming Kafka App to Process Hadoop Jobs Logs
 Kafka Producer to send all slaves logs to Spark Streaming App
 Spark Streaming App to Process the Logs with given rules and produce the Bad
Images, Bad records, Missed Records etc.
 Spark Streaming App collect user actions data from front end
 Kafka Producer based Rest API to collect user events and send to Spark Streaming
App
 Hive Queries to Generate Stock Alerts, Price Alerts, Popular Products Alerts, New
Arrivals for each user based on given likes, favorite, shares count information
 Worked on SparkML library for Recommendations, Coupons Recommendations,
Rules Engine.
Environment: HSearch (Hbase+lucene), Cassandra, Hive, Spark (Core, SQL, ML, Streaming), Hadoop,
MapReduce, Amazon Web Service, Linode, CDN, Scala, Java, Affiliates feeds Rakuten, CJ, Affiliate window,
Web Gains.
Project Name: Cimbal/MobApp Pay
Client: Intel
Project Description: Cimbal is a mobile promotion and payment network designed to increase business sales
and deliver targeted deals to consumers
Responsibilities:
 Written MapReduce programs to validate the data
 Written more than 50 Spring Data Hbase rest API's in Java
 Schema design on Hbase and cleaning data
 Written Hive queries for analytics on users data.
Environment: Hadoop MapReduce, Hbase, Spring Data Rest Web Service, CDH, Users Payment Data
Project Name: Truck Events Analysis
Client: HortonWorks
Project Description: The Trucking business is a high-risk business in which truck drivers venture into remote
areas, often in harsh weather conditions and chaotic traffic on a daily basis. Using this solution illustrating
Modern Data Architecture with Hortonworks Data Platform, we have developed a centralized management
system that can help reduce risk and lower the total cost of operations.
Responsibilities:
 Written a simulator to send/emit events based on NYC DOT data file.
 Written Kafka Producer to accept/send events to Kafka Producer which is on Storm Spout
 Written Storm topology to accept events from Kafka Producer and Process Events
 Written Storm Bolt to Emit data into Hbase, HDFS, Rabbit-MQ Web Stomp
 Hive Queries to Map Truck Events Data, Weather Data, Traffic Data
Environment: Hadoop, HDFS, Hive, HBase, Kafka, Storm, Rabbit-MQ WebStormp, Google Maps, New York
City Truck Routes from NYC DOT. -Truck Events Data generated using a custom
simulator. - Weather Data, collected using APIs from Forcast.io. -Traffic Data, collected using APIs from
MapQuest.
Project Name: Comparative Analysis of Big Data Analytical Tools  (Hive, Hive on Tez, Impala, SparkQL,
Apache Drill, BigQuery, PrestoDB running on the Google Cloud and AWS)
Client: ThirdEyeCss.com
Responsibilities:
 Installation of Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB, Hadoop,
Cloudera CDH, Hortonworks HDP
 Schema design for data sets on all Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery,
PrestoDB
 Query design for given data set
 Debugging on Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB, Hadoop,
Cloudera CDH, Hortonworks HDP
 Time Comparison of each Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB
 Time comparison between different cloud platforms
 Times metrics web based visualization design on google charts
Environment: Hive, Hive on Tez, Impala, SparkSQL, Apache Drill, BigQuery, PrestoDB, Hadoop, Cloudera
CDH, Hortonworks HDP, Google Cloud Platform, Amazon web service, Twitter streaming data
Senior Big Data Consultant
Positive Bioscience  Mumbai, India Jan-2013 to Dec-2013
Project Name: Next Generation DNA Sequencing Analysis
Client: Positive Bioscience
Responsibilities:
 Developed a Hadoop MapReduce program to perform sequence alignment on NGS data.
 The MapReduce program implements algorithms such as Borrows-Wheeler Transform (BWT),
Ferragina-Manzini Index (FMI), Smith-Waterman dynamic programming algorithm using Hadoop
distributed cache.
 Design and development of software for Bioinformatics, Next Generation Sequencing (NGS) in
Hadoop MapReduce framework, Cassandra using Amazon S3, Amazon EC2, Amazon Elastic
MapReduce(EMR).
 Developed Hadoop MapReduce program to perform custom Quality Check on genomic data. Novel
features of the program included capability to handle file-format/sequencing-machine errors,
automatic detection of base-line PHRED score and being platform agnostic (Illumina, 454 Roche,
Complete Genomics, ABI Solid input format data).
 Developed a Hadoop MapReduce program to perform sequence alignment on sequencing data. The
MapReduce program implements algorithms such as Borrows-Wheeler Transform (BWT), Ferragina-
Manzini Index (FMI), Smith-Waterman dynamic programming algorithm using Hadoop distributed
cache.
 Configured and ran all MapReduce programs on 20-30 node cluster (Amazon EC2 spot instances) with
Apache Hadoop-1.4.0 to handle 600GB/sample of NGS genomics data.
 Configured a 20-30 node (Amazon EC2 spot Instance) Hadoop cluster to transfer the data from
Amazon S3 to HDFS and HDFS to Amazon S3 and also to direct input and output to the Hadoop
MapReduce framework.
 Successfully ran all Hadoop MapReduce programs on Amazon Elastic MapReduce framework by using
Amazon S3 for Input and Output.
 Developed java Restful web services to upload data from local to Amazon S3, listing S3 objects and
file manipulation operations.
 Developed MapReduce programs to perform Quality Check, Sequence Alignment, SNP calling,
SV/CNV detection on single-end/paired-end NGS data.
 Designed and transmitted a RDBMS(SQL) Database to NOSQL Cassandra Database.
Hadoop Developer
PointCross.com - Bangalore, India Nov-2011 to Jan-2012
Project Name: DDSR (Drilling Data Search and Repository)
This project aims to provide analytics for Oil and Gas exploration data. This DDSR repository build by using
HBase, Hadoop and its sub projects. We are collecting thousands of wells data from across the globe. This
data is stored in Hbase and Hive by using Hadoop MapReduce jobs. On top of this data we are building
analytics for search and advanced search.
Project Name: Seismic Data Server & Repository (SDSR)
Our Seismic Data Server & Repository solves the problem of delivering, on demand, precisely cropped SEG-Y
files for instant loading at geophysical interpretation workstations anywhere in the network. Based on
Hadoop file storage, Hbase and MapReduce technology, the Seismic Data Server brings fault-tolerant
petabyte-scale store capability to the industry. Seismic Data Server supports post-stack traces now with pre-
stack support to be released shortly.
ADDITIONAL INFORMATION:
EDUCATION:
 Bachelor of Engineering in Computer Science and Engineering, University of VTU Bangalore, Karnatak
a, India, 2011.
 Diploma in Computer Science and engineering University of KAE, Bangalore, Karnataka, India, 2008
REFERENCE LINKS:
 LinkedIn: https://in.linkedin.com/in/sandishkumar
 Twitter: https://twitter.com/sandishsany
 GitHub: https://github.com/SandishHadoop
 Skype: sandishhadoop
Ad

Recommended

sam_resume - updated
sam_resume - updated
sam k
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Mopuru Babu
Manikyam_Hadoop_5+Years
Manikyam_Hadoop_5+Years
Manikyam M
Advanced analytics with sap hana and r
Advanced analytics with sap hana and r
SAP Technology
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
Nick Dimiduk
Big data with java
Big data with java
Stefan Angelov
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
Denny Lee
Industry experts webinar slides (final v1.0)
Industry experts webinar slides (final v1.0)
NuoDB
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
Prashanth Shankar kumar
Architectural Evolution Starting from Hadoop
Architectural Evolution Starting from Hadoop
SpagoWorld
Apache Spark PDF
Apache Spark PDF
Naresh Rupareliya
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Edureka!
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
Yu's resume
Yu's resume
Yu(Rein) Wang
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Mopuru Babu
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Paco Nathan
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
Ant
Ant
Saptarshi Banerjee
Divyesh resume it&tech
Divyesh resume it&tech
Patel Divyesh
#Mohit resume
#Mohit resume
mohit anand
Suraj kumar -oct-15
Suraj kumar -oct-15
suraj kumar
ABU JANDAL_CV
ABU JANDAL_CV
abu Jandal
Spark1
Spark1
Dr. G. Bharadwaja Kumar
Resume Syed Mansoor Ahmed
Resume Syed Mansoor Ahmed
Mansoor Ahmed
DebojyotiLahiri_DetailedResume
DebojyotiLahiri_DetailedResume
Debojyoti Lahiri
sudipto_resume
sudipto_resume
Sudipto Saha
a9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docx
VasimMemon4
Nagarjuna_Damarla
Nagarjuna_Damarla
Nag Arjun
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Mopuru Babu

More Related Content

What's hot (10)

Industry experts webinar slides (final v1.0)
Industry experts webinar slides (final v1.0)
NuoDB
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
Prashanth Shankar kumar
Architectural Evolution Starting from Hadoop
Architectural Evolution Starting from Hadoop
SpagoWorld
Apache Spark PDF
Apache Spark PDF
Naresh Rupareliya
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Edureka!
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
Yu's resume
Yu's resume
Yu(Rein) Wang
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Mopuru Babu
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Paco Nathan
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
Industry experts webinar slides (final v1.0)
Industry experts webinar slides (final v1.0)
NuoDB
Architectural Evolution Starting from Hadoop
Architectural Evolution Starting from Hadoop
SpagoWorld
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Edureka!
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Mopuru Babu
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Paco Nathan
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance

Viewers also liked (8)

Ant
Ant
Saptarshi Banerjee
Divyesh resume it&tech
Divyesh resume it&tech
Patel Divyesh
#Mohit resume
#Mohit resume
mohit anand
Suraj kumar -oct-15
Suraj kumar -oct-15
suraj kumar
ABU JANDAL_CV
ABU JANDAL_CV
abu Jandal
Spark1
Spark1
Dr. G. Bharadwaja Kumar
Resume Syed Mansoor Ahmed
Resume Syed Mansoor Ahmed
Mansoor Ahmed
DebojyotiLahiri_DetailedResume
DebojyotiLahiri_DetailedResume
Debojyoti Lahiri
Ad

Similar to Sandish3Certs (20)

sudipto_resume
sudipto_resume
Sudipto Saha
a9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docx
VasimMemon4
Nagarjuna_Damarla
Nagarjuna_Damarla
Nag Arjun
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Mopuru Babu
Poorna Hadoop
Poorna Hadoop
Poornachandrarao Kommana
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developer
Rajeev Kumar
Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0
Pabba Gupta
HariKrishna4+_cv
HariKrishna4+_cv
revuri
Robin_Hadoop
Robin_Hadoop
Robin David
Jayaram_Parida- Big Data Architect and Technical Scrum Master
Jayaram_Parida- Big Data Architect and Technical Scrum Master
Jayaram Parida
YUVAM17_BIGDATA
YUVAM17_BIGDATA
Yuvaraj Mani
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
Himansu Behera
Resume_VipinKP
Resume_VipinKP
indhuparvathy
Resume - Narasimha Rao B V (TCS)
Resume - Narasimha Rao B V (TCS)
Venkata Narasimha Rao B
Sudhanshu'sProjects
Sudhanshu'sProjects
Sudhanshu Sharma
hadoop resume
hadoop resume
Hassan Qureshi
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
Krishna Kumar Sharma
Pankaj Resume for Hadoop,Java,J2EE - Outside World
Pankaj Resume for Hadoop,Java,J2EE - Outside World
Pankaj Kumar
Bharat big data_resume
Bharat big data_resume
bharatrana123456
Anil_BigData Resume
Anil_BigData Resume
Anil Sokhal
a9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docx
VasimMemon4
Nagarjuna_Damarla
Nagarjuna_Damarla
Nag Arjun
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Mopuru Babu
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developer
Rajeev Kumar
Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0
Pabba Gupta
HariKrishna4+_cv
HariKrishna4+_cv
revuri
Jayaram_Parida- Big Data Architect and Technical Scrum Master
Jayaram_Parida- Big Data Architect and Technical Scrum Master
Jayaram Parida
YUVAM17_BIGDATA
YUVAM17_BIGDATA
Yuvaraj Mani
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
Himansu Behera
Pankaj Resume for Hadoop,Java,J2EE - Outside World
Pankaj Resume for Hadoop,Java,J2EE - Outside World
Pankaj Kumar
Anil_BigData Resume
Anil_BigData Resume
Anil Sokhal
Ad

More from Sandish Kumar H N (6)

Hadoop_Spark_Developer
Hadoop_Spark_Developer
Sandish Kumar H N
Sandish_Kumar_HN
Sandish_Kumar_HN
Sandish Kumar H N
Cassandra Developer Certification
Cassandra Developer Certification
Sandish Kumar H N
Hadoop on aws amazon
Hadoop on aws amazon
Sandish Kumar H N
Hadoop on aws amazon
Hadoop on aws amazon
Sandish Kumar H N
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
Sandish Kumar H N

Sandish3Certs

  • 1. Sandish Kumar H N sanysandish@gmail.com +919008990742, skype: sandishhadoop SUMMARY: I'm Senior Big Data Consultant with 4+ years of experience in multiple business domains using latest technologies and platforms. Focus is on Cloud, Big Data, Machine Learning, Data Science and Data Mining. A skilled developer, architect with strong problem solving, debugging and analytical capabilities, who creates the technical vision and actively engages in understanding customer requirements. I focus particularly in software performance and efficiency. Result oriented and hands on, who skillfully balances between meeting resource and time constraints, while doing it right. Strong experience creating real time data streaming solutions using Apache Spark Core, Spark SQL & Data Frames, Spark Streaming, Apache Storm, Kafka. Experience in building Data-pipe lines using Big Data Technologies Hands-on experience in writing MapReduce programs and user-defined functions for Hive and Pig Experience in NoSQL technologies like HBase, Cassandra Excellent understanding /knowledge on Hadoop (Gen-1 and Gen-2) and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager (YARN). Excellent understanding and knowledge of NOSQL databases like HBase, and Cassandra. Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and from RDBMS to HDFS. Proficient at using Spark APIs to cleanse, explore, aggregate, transform, and store machine sensor data Configured a 20-30 node (Amazon EC2 spot Instance) Hadoop cluster to transfer the data from Amazon S3 to HDFS and HDFS to Amazon S3 and also to direct input and output to the Hadoop MapReduce framework. Hands-on experience with systems-building languages such as Scala, Java Hands-on experience with message brokers such as Apache Kafka and RabbitMQ. Worked extensively with Dimensional modeling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses. Implemented Hadoop based data warehouses, integrated Hadoop with Enterprise Data Warehouse systems Built real-time Big Data solutions using HBASE handling billions of records. Involved in designing the data model in Hive for migrating the ETL process into Hadoop and wrote Pig Scripts to load data into Hadoop environment. Expertise in writing Hive UDF, Generic UDFs to incorporate complex business logic into hive queries in the process of performing high level data analysis. Worked on Spark Machine Learning library for Recommendations, Coupons Recommendations, Rules Engine. Experience in working with various Cloudera distributions (CDH4/CDH5) and have knowledge on Hortonworks and Amazon EMR Hadoop Distributions. Experience in administering large scale Hadoop environments including design, configuration, installation, performance tuning and monitoring of the cluster using Cloudera manager and ganglia. Worked extensively with Dimensional modeling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses
  • 2. Experience in Object Oriented Analysis Design (OOAD) and development of software using UML Methodology, good knowledge of J2EE design patterns and Core Java design patterns. Experience in designing both time driven and data driven automated workflows using Oozie. Experience in writing UNIX shell scripts. CERTIFICATION: DataBricks Certified Apache Spark Developer MapR Certified Apache Hadoop Developer DataStax Certified Apache Cassandra Developer TECHNICAL HIGHLIGHTS Big Data Technologies: Hadoop (Horton works, Cloudera, MapR), Spark, Spark Streaming, Spark Sql, Spark ML, Map reduce, HDFS, Cassandra, Storm, Apache Kafka, Flume, Oozie, Solr, Zookeeper, Solr, Tez, Data Modelling, Pig, Hive, Impala, Drill, Sqoop and RabbitMQ. NO SQL Database: Hbase, Cassandra, SQL DBs: Hive, Pig, PrestoDB, Impala, SparkQL Search: Hsearch, Apache Blur, Lucene, Elastic Search, Nutch Programming Languages: Java, Scala, Basics (Python, Cloujer) Cloud Platform: Amazon Web Services (EC2, Amazon Elastic Map Reduce, Amazon S3), Google Cloud Platform (Big Query, App Engine, Compute Engine, CloudSQL), Rack Space (CDN, Servers, Storage), Linode Manager Monitoring and Reporting: Ganglia, Nagios, Custom Shell scripts, Tableau, D3.js, Google Charts Data: E-Commerce, Social Media, Logs and click events data, Next Generation Genomic Data, Oil & Gas, Health Care, Travel Other: HTML, JavaScript, Extjs, CSS, JQuery WORK EXPERIENCE: Senior Big Data Consultant Third Eye Consulting Services & Solutions -Bangalore, India Dec 2013 to present Project Name: Yardstick- Spark (Open Source) Client: Grid Gain Project Description: Yardstick Apache Spark is a set of apache Spark and apache Ignite comparative benchmarks written on top of Yardstick framework. Responsibilities: I have successfully written Spark CoreRDD application to read auto generated 1 billion records and compare with IgniteRDD in Yardstick framework to measure performance of Apache Ignite RDD and Apache Spark RDD. I have successfully written Spark DataFrame application to read from HDFS and analyze 10 million twitter records using Yardstick framework to measure performance of Apache Ignite SQL and Apache Spark DataFrame. I have successfully written Spark Streaming application to read streaming twitter data and analyze twitter records in real -time using Yardstick framework to measure performance of Apache Ignite Streaming and Apache Spark Streaming. Implemented test cases for Spark and Ignite functions using Scala as language. Hands--on experience in setting up 10 node Spark cluster on Amazon Web Services using Spark EC2 script. Implemented D3.js and Tableau charts to show performance difference between Apache Ignite and Apache Spark.
  • 3. Environment: Spark, Spark Core, Data Frames, Spark Streaming, Scala, HDFS, Apache Ignite, Yardstick Tool, D3js, Tableau, AWS, 10 million twitter data records and 1 billion auto generated records. Project Name: E-Commerce Data Pipe Line Client: Obsessory.com Project Description: Obsessory is a technology company that provides a web and mobile platform to assist shoppers in discovery, search, comparison, and tracking of items across the Internet. Obsessorys powerful search engine catalogs millions of products from online stores on a daily basis and uses proprietary algorithms to enhance the depth and breadth of the users search. Obsessory employs adaptive and social learning to continuously refine the search results and present the user with the most relevant selection of items. Furthermore, Obsessory helps users keep track of desired items across the Internet and get notified of price changes, and availability of tracked items as well as sales, store events and promotions Responsibilities: Pre-Processing: Crawling of 100+ sites Data using Nutch Fashion based ontology maintenance Using Scala, Spark & echo system to enriched given data using Fashion Ontology to Validation/Normalizing the data Designed schema and modeling of data and Written algorithm to store all validated data in Cassandra using Spring Data Cassandra REST Programs for Validation/Normalizing/Enriching and REST API to Develop UI Based on manual QA Validation. Used SparkSQL, Scala to running QA based SQL queries. Indexing: MR Programs on top of Hbase: To standardize the Input Merchants data To upload images to RackSpace CDN To index the given Data sets into HSearch To MR programs on Hbase to extract the color information from Images including density. To MR programs on Hbase to persist the data on Hbase tables above MR jobs will run based on timing and bucketing. Color-Obsessed: Using Image color and density data Users are allowed to select 1,2.. colors with different densities and result will be a list of products where each product image contains all give colors with exact density this has been implemented on top Hbase using Spring REST web service for color Obsessed search API. Post-Processing: Setting up the Spark Streaming and Kafka Cluster Developed a Spark Streaming Kafka App to Process Hadoop Jobs Logs Kafka Producer to send all slaves logs to Spark Streaming App Spark Streaming App to Process the Logs with given rules and produce the Bad Images, Bad records, Missed Records etc. Spark Streaming App collect user actions data from front end Kafka Producer based Rest API to collect user events and send to Spark Streaming App Hive Queries to Generate Stock Alerts, Price Alerts, Popular Products Alerts, New Arrivals for each user based on given likes, favorite, shares count information Worked on SparkML library for Recommendations, Coupons Recommendations, Rules Engine.
  • 4. Environment: HSearch (Hbase+lucene), Cassandra, Hive, Spark (Core, SQL, ML, Streaming), Hadoop, MapReduce, Amazon Web Service, Linode, CDN, Scala, Java, Affiliates feeds Rakuten, CJ, Affiliate window, Web Gains. Project Name: Cimbal/MobApp Pay Client: Intel Project Description: Cimbal is a mobile promotion and payment network designed to increase business sales and deliver targeted deals to consumers Responsibilities: Written MapReduce programs to validate the data Written more than 50 Spring Data Hbase rest API's in Java Schema design on Hbase and cleaning data Written Hive queries for analytics on users data. Environment: Hadoop MapReduce, Hbase, Spring Data Rest Web Service, CDH, Users Payment Data Project Name: Truck Events Analysis Client: HortonWorks Project Description: The Trucking business is a high-risk business in which truck drivers venture into remote areas, often in harsh weather conditions and chaotic traffic on a daily basis. Using this solution illustrating Modern Data Architecture with Hortonworks Data Platform, we have developed a centralized management system that can help reduce risk and lower the total cost of operations. Responsibilities: Written a simulator to send/emit events based on NYC DOT data file. Written Kafka Producer to accept/send events to Kafka Producer which is on Storm Spout Written Storm topology to accept events from Kafka Producer and Process Events Written Storm Bolt to Emit data into Hbase, HDFS, Rabbit-MQ Web Stomp Hive Queries to Map Truck Events Data, Weather Data, Traffic Data Environment: Hadoop, HDFS, Hive, HBase, Kafka, Storm, Rabbit-MQ WebStormp, Google Maps, New York City Truck Routes from NYC DOT. -Truck Events Data generated using a custom simulator. - Weather Data, collected using APIs from Forcast.io. -Traffic Data, collected using APIs from MapQuest. Project Name: Comparative Analysis of Big Data Analytical Tools (Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB running on the Google Cloud and AWS) Client: ThirdEyeCss.com Responsibilities: Installation of Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB, Hadoop, Cloudera CDH, Hortonworks HDP Schema design for data sets on all Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB Query design for given data set Debugging on Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB, Hadoop, Cloudera CDH, Hortonworks HDP Time Comparison of each Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB Time comparison between different cloud platforms Times metrics web based visualization design on google charts Environment: Hive, Hive on Tez, Impala, SparkSQL, Apache Drill, BigQuery, PrestoDB, Hadoop, Cloudera CDH, Hortonworks HDP, Google Cloud Platform, Amazon web service, Twitter streaming data
  • 5. Senior Big Data Consultant Positive Bioscience Mumbai, India Jan-2013 to Dec-2013 Project Name: Next Generation DNA Sequencing Analysis Client: Positive Bioscience Responsibilities: Developed a Hadoop MapReduce program to perform sequence alignment on NGS data. The MapReduce program implements algorithms such as Borrows-Wheeler Transform (BWT), Ferragina-Manzini Index (FMI), Smith-Waterman dynamic programming algorithm using Hadoop distributed cache. Design and development of software for Bioinformatics, Next Generation Sequencing (NGS) in Hadoop MapReduce framework, Cassandra using Amazon S3, Amazon EC2, Amazon Elastic MapReduce(EMR). Developed Hadoop MapReduce program to perform custom Quality Check on genomic data. Novel features of the program included capability to handle file-format/sequencing-machine errors, automatic detection of base-line PHRED score and being platform agnostic (Illumina, 454 Roche, Complete Genomics, ABI Solid input format data). Developed a Hadoop MapReduce program to perform sequence alignment on sequencing data. The MapReduce program implements algorithms such as Borrows-Wheeler Transform (BWT), Ferragina- Manzini Index (FMI), Smith-Waterman dynamic programming algorithm using Hadoop distributed cache. Configured and ran all MapReduce programs on 20-30 node cluster (Amazon EC2 spot instances) with Apache Hadoop-1.4.0 to handle 600GB/sample of NGS genomics data. Configured a 20-30 node (Amazon EC2 spot Instance) Hadoop cluster to transfer the data from Amazon S3 to HDFS and HDFS to Amazon S3 and also to direct input and output to the Hadoop MapReduce framework. Successfully ran all Hadoop MapReduce programs on Amazon Elastic MapReduce framework by using Amazon S3 for Input and Output. Developed java Restful web services to upload data from local to Amazon S3, listing S3 objects and file manipulation operations. Developed MapReduce programs to perform Quality Check, Sequence Alignment, SNP calling, SV/CNV detection on single-end/paired-end NGS data. Designed and transmitted a RDBMS(SQL) Database to NOSQL Cassandra Database. Hadoop Developer PointCross.com - Bangalore, India Nov-2011 to Jan-2012 Project Name: DDSR (Drilling Data Search and Repository) This project aims to provide analytics for Oil and Gas exploration data. This DDSR repository build by using HBase, Hadoop and its sub projects. We are collecting thousands of wells data from across the globe. This data is stored in Hbase and Hive by using Hadoop MapReduce jobs. On top of this data we are building analytics for search and advanced search. Project Name: Seismic Data Server & Repository (SDSR) Our Seismic Data Server & Repository solves the problem of delivering, on demand, precisely cropped SEG-Y files for instant loading at geophysical interpretation workstations anywhere in the network. Based on Hadoop file storage, Hbase and MapReduce technology, the Seismic Data Server brings fault-tolerant
  • 6. petabyte-scale store capability to the industry. Seismic Data Server supports post-stack traces now with pre- stack support to be released shortly. ADDITIONAL INFORMATION: EDUCATION: Bachelor of Engineering in Computer Science and Engineering, University of VTU Bangalore, Karnatak a, India, 2011. Diploma in Computer Science and engineering University of KAE, Bangalore, Karnataka, India, 2008 REFERENCE LINKS: LinkedIn: https://in.linkedin.com/in/sandishkumar Twitter: https://twitter.com/sandishsany GitHub: https://github.com/SandishHadoop Skype: sandishhadoop