際際滷

際際滷Share a Scribd company logo
Runs scored by Players Analysis
with Flume and Pig
Nitesh Ghosh
Contents
Problem Statement.................................................................................................................................................3
Solution Architecture..............................................................................................................................................4
Software and Tools Specification .............................................................................................................................5
Solution Description................................................................................................................................................6
Program Code.........................................................................................................................................................7
Conclusion..............................................................................................................................................................9
Problem Statement
Data Set: This is data for runs scored by players in different countries in different years. Lets assume some
external process is writing data into a directory in CSV format where columns of the data are as shown below:
Problem Statement:
Assume data is copied periodically into /home/cloudera/runs directory. Write a flume configuration to copy
this data to HDFS using flume and then write a PIG script to process data using PIG to find out sum of run
scored and balls played by each player.
Solution Architecture
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large
amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and
fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic application.
The diagram above shows a high-level view of how Apache Flume interacts with Agent Service and gets data
to HDFS using Flume components - source, channel and sink, once data loads to HDFS. Using Apache Pig we
then visualize the data. Apache Flume is a data ingestion system that is configured by defining endpoints in
a data flow called sources and sinks. In Flume, each individual piece of data is called event sources, which
produce events, and send the events through a channel, which connects the source to the sink. The sink then
writes the events out to a predefined location.
Software and Tools Specification
 Oracle Virtual Box- Version5.2.8r121009 (Qt5.6.2)
 Ubantu 16.04 LTS
 Apache Hadoop -Version2.7.6(ClusterEnvironment)
 Apache Hive- Version2.3.3(SetuponEdge Node)
 Apache Flume- Version0.17.0
Solution Description
We needtosetupHDFS fromHadoop eco-systemsothatothercomponentlike flumeandpig can work on that. In order
to setup that, we need to download files from apache website and do the installation on Ubantu machine. After a
successful installation, we can then verify whether Hadoop is installed on machine successfully.
Program Code
Once successfullysetupwithHDFSwe needtoconfigure Flume andsetupconfigurationfilesforflume.
Place the configurationfileinside flume/confdirectory. We needtomake twochanges inside.conf asfollowed.
 Agent1.sources.source1_1.spoolDirissetwithinputpathas inlocal file systempath.
 Agent1.sinks.hdfs-sink1_1.hdfs.pathissetwithoutputpathasin HDFS path.
ConfigurationDetails
agent1.channels.fileChannel1_1.type=file
agent1.channels.fileChannel1_1.capacity=200000
agent1.channels.fileChannel1_1.transactionCapacity=1000
agent1.sources.source1_1.type =spooldir
agent1.sources.source1_1.spoolDir=/home/hadoopuser/Downloads/tmpload
agent1.sources.source1_1.fileHeader=false
agent1.sources.source1_1.fileSuffix =.COMPLETED
agent1.sinks.hdfs-sink1_1.type=hdfs
agent1.sinks.hdfs-sink1_1.hdfs.path=hdfs://localhost:9000/user/cloudera/flume_sink
agent1.sinks.hdfs-sink1_1.hdfs.batchSize =1000
agent1.sinks.hdfs-sink1_1.hdfs.rollSize =268435456
agent1.sinks.hdfs-sink1_1.hdfs.rollInterval=0
agent1.sinks.hdfs-sink1_1.hdfs.rollCount=50000000
agent1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text
agent1.sinks.hdfs-sink1_1.hdfs.fileType=DataStream
agent1.sources.source1_1.channels=fileChannel1_1
agent1.sinks.hdfs-sink1_1.channel =fileChannel1_1
agent1.sinks= hdfs-sink1_1
agent1.sources=source1_1
agent1.channels=fileChannel1_1
File placedontmploadfolder
Flume storingfile insideHDFSwe can see inabove screen.
Nowwe needtosetupPigto analysisdatawhichisstoredon HDFS.
 A= LOAD '/user/cloudera/flume_sink/FlumeData.1526646743902' USINGPigStorage(' ') AS (Player_id:int, Year:chararray,
Country:chararray,Opposition_Team:chararray, Runs_Scored:int, Balls_Played:int);
 B = FOREACH A GENERATE Player_id, Year, Country, Opposition_Team, Runs_Scored, Balls_Played;
 C = GROUP B BY Player_id;
 D = foreach C generate group,SUM(B.Runs_Scored);
 D = foreach C generate group,SUM(B.Runs_Scored),SUM(B.Balls_Played);
Conclusion
Folder Logging/Spooling is a wide branch for analysis. We have a number of applications which send and place their
applicationfeeds, sothat reporting tools can analyze on top of that data and organization can take benefit and growth
withthe data. Inthis projectwe have done analysis based on csv data, which keeps feeding on periodic time using Pig
language which we have visualized.

More Related Content

What's hot (20)

How To Install and Configure Log Rotation on RHEL 7 or CentOS 7
How To Install and Configure Log Rotation on RHEL 7 or CentOS 7How To Install and Configure Log Rotation on RHEL 7 or CentOS 7
How To Install and Configure Log Rotation on RHEL 7 or CentOS 7
VCP Muthukrishna
How To Install and Configure SNMP on RHEL 7 or CentOS 7
How To Install and Configure SNMP on RHEL 7 or CentOS 7How To Install and Configure SNMP on RHEL 7 or CentOS 7
How To Install and Configure SNMP on RHEL 7 or CentOS 7
VCP Muthukrishna
How To Configure Apache VirtualHost on RHEL 7 on AWS
How To Configure Apache VirtualHost on RHEL 7 on AWSHow To Configure Apache VirtualHost on RHEL 7 on AWS
How To Configure Apache VirtualHost on RHEL 7 on AWS
VCP Muthukrishna
VMWare Tools Installation and Troubleshooting Guide
VMWare Tools Installation and Troubleshooting GuideVMWare Tools Installation and Troubleshooting Guide
VMWare Tools Installation and Troubleshooting Guide
VCP Muthukrishna
How To Find Package Installation Date on RHEL 7
How To Find Package Installation Date on RHEL 7How To Find Package Installation Date on RHEL 7
How To Find Package Installation Date on RHEL 7
VCP Muthukrishna
How To Install OpenFire in CentOS 7
How To Install OpenFire in CentOS 7How To Install OpenFire in CentOS 7
How To Install OpenFire in CentOS 7
VCP Muthukrishna
Windows PowerShell Basics How To Create powershell for loop
Windows PowerShell Basics  How To Create powershell for loopWindows PowerShell Basics  How To Create powershell for loop
Windows PowerShell Basics How To Create powershell for loop
VCP Muthukrishna
Install and Configure RSyslog CentOS 7 / RHEL 7
Install and Configure RSyslog  CentOS 7 / RHEL 7Install and Configure RSyslog  CentOS 7 / RHEL 7
Install and Configure RSyslog CentOS 7 / RHEL 7
VCP Muthukrishna
How to Troubleshoot SELinux Audit2Allow unable to open (null)
How to Troubleshoot SELinux Audit2Allow unable to open (null)How to Troubleshoot SELinux Audit2Allow unable to open (null)
How to Troubleshoot SELinux Audit2Allow unable to open (null)
VCP Muthukrishna
How To Manage Services on RHEL 7 or CentOS 7
How To Manage Services on RHEL 7 or CentOS 7How To Manage Services on RHEL 7 or CentOS 7
How To Manage Services on RHEL 7 or CentOS 7
VCP Muthukrishna
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
VCP Muthukrishna
How To Install and Configure GNome on CentOS 7
How To Install and Configure GNome on CentOS 7How To Install and Configure GNome on CentOS 7
How To Install and Configure GNome on CentOS 7
VCP Muthukrishna
derby onboarding (1)
derby onboarding (1)derby onboarding (1)
derby onboarding (1)
Kumar Aneesh
How To Install and Configure Screen on CentOS 7
How To Install and Configure Screen on CentOS 7How To Install and Configure Screen on CentOS 7
How To Install and Configure Screen on CentOS 7
VCP Muthukrishna
How to install and configure firewall on ubuntu os
How to install and configure firewall on ubuntu osHow to install and configure firewall on ubuntu os
How to install and configure firewall on ubuntu os
VCP Muthukrishna
How To List Nginx Modules Installed / Complied on CentOS 7
How To List Nginx Modules Installed / Complied on CentOS 7How To List Nginx Modules Installed / Complied on CentOS 7
How To List Nginx Modules Installed / Complied on CentOS 7
VCP Muthukrishna
How To Install and Configure AWS CLI on RHEL 7
How To Install and Configure AWS CLI on RHEL 7How To Install and Configure AWS CLI on RHEL 7
How To Install and Configure AWS CLI on RHEL 7
VCP Muthukrishna
How To Add DVD ISO to YUM Repository in CentOS 6
How To Add DVD ISO to YUM Repository in CentOS 6How To Add DVD ISO to YUM Repository in CentOS 6
How To Add DVD ISO to YUM Repository in CentOS 6
VCP Muthukrishna
How To Install and Configure AWS CLI for Windows
How To Install and Configure AWS CLI for WindowsHow To Install and Configure AWS CLI for Windows
How To Install and Configure AWS CLI for Windows
VCP Muthukrishna
Install VMWare Tools CentOS 7
Install VMWare Tools CentOS 7Install VMWare Tools CentOS 7
Install VMWare Tools CentOS 7
VCP Muthukrishna
How To Install and Configure Log Rotation on RHEL 7 or CentOS 7
How To Install and Configure Log Rotation on RHEL 7 or CentOS 7How To Install and Configure Log Rotation on RHEL 7 or CentOS 7
How To Install and Configure Log Rotation on RHEL 7 or CentOS 7
VCP Muthukrishna
How To Install and Configure SNMP on RHEL 7 or CentOS 7
How To Install and Configure SNMP on RHEL 7 or CentOS 7How To Install and Configure SNMP on RHEL 7 or CentOS 7
How To Install and Configure SNMP on RHEL 7 or CentOS 7
VCP Muthukrishna
How To Configure Apache VirtualHost on RHEL 7 on AWS
How To Configure Apache VirtualHost on RHEL 7 on AWSHow To Configure Apache VirtualHost on RHEL 7 on AWS
How To Configure Apache VirtualHost on RHEL 7 on AWS
VCP Muthukrishna
VMWare Tools Installation and Troubleshooting Guide
VMWare Tools Installation and Troubleshooting GuideVMWare Tools Installation and Troubleshooting Guide
VMWare Tools Installation and Troubleshooting Guide
VCP Muthukrishna
How To Find Package Installation Date on RHEL 7
How To Find Package Installation Date on RHEL 7How To Find Package Installation Date on RHEL 7
How To Find Package Installation Date on RHEL 7
VCP Muthukrishna
How To Install OpenFire in CentOS 7
How To Install OpenFire in CentOS 7How To Install OpenFire in CentOS 7
How To Install OpenFire in CentOS 7
VCP Muthukrishna
Windows PowerShell Basics How To Create powershell for loop
Windows PowerShell Basics  How To Create powershell for loopWindows PowerShell Basics  How To Create powershell for loop
Windows PowerShell Basics How To Create powershell for loop
VCP Muthukrishna
Install and Configure RSyslog CentOS 7 / RHEL 7
Install and Configure RSyslog  CentOS 7 / RHEL 7Install and Configure RSyslog  CentOS 7 / RHEL 7
Install and Configure RSyslog CentOS 7 / RHEL 7
VCP Muthukrishna
How to Troubleshoot SELinux Audit2Allow unable to open (null)
How to Troubleshoot SELinux Audit2Allow unable to open (null)How to Troubleshoot SELinux Audit2Allow unable to open (null)
How to Troubleshoot SELinux Audit2Allow unable to open (null)
VCP Muthukrishna
How To Manage Services on RHEL 7 or CentOS 7
How To Manage Services on RHEL 7 or CentOS 7How To Manage Services on RHEL 7 or CentOS 7
How To Manage Services on RHEL 7 or CentOS 7
VCP Muthukrishna
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
VCP Muthukrishna
How To Install and Configure GNome on CentOS 7
How To Install and Configure GNome on CentOS 7How To Install and Configure GNome on CentOS 7
How To Install and Configure GNome on CentOS 7
VCP Muthukrishna
derby onboarding (1)
derby onboarding (1)derby onboarding (1)
derby onboarding (1)
Kumar Aneesh
How To Install and Configure Screen on CentOS 7
How To Install and Configure Screen on CentOS 7How To Install and Configure Screen on CentOS 7
How To Install and Configure Screen on CentOS 7
VCP Muthukrishna
How to install and configure firewall on ubuntu os
How to install and configure firewall on ubuntu osHow to install and configure firewall on ubuntu os
How to install and configure firewall on ubuntu os
VCP Muthukrishna
How To List Nginx Modules Installed / Complied on CentOS 7
How To List Nginx Modules Installed / Complied on CentOS 7How To List Nginx Modules Installed / Complied on CentOS 7
How To List Nginx Modules Installed / Complied on CentOS 7
VCP Muthukrishna
How To Install and Configure AWS CLI on RHEL 7
How To Install and Configure AWS CLI on RHEL 7How To Install and Configure AWS CLI on RHEL 7
How To Install and Configure AWS CLI on RHEL 7
VCP Muthukrishna
How To Add DVD ISO to YUM Repository in CentOS 6
How To Add DVD ISO to YUM Repository in CentOS 6How To Add DVD ISO to YUM Repository in CentOS 6
How To Add DVD ISO to YUM Repository in CentOS 6
VCP Muthukrishna
How To Install and Configure AWS CLI for Windows
How To Install and Configure AWS CLI for WindowsHow To Install and Configure AWS CLI for Windows
How To Install and Configure AWS CLI for Windows
VCP Muthukrishna
Install VMWare Tools CentOS 7
Install VMWare Tools CentOS 7Install VMWare Tools CentOS 7
Install VMWare Tools CentOS 7
VCP Muthukrishna

Similar to Runs scored by Players Analysis with Flume and Pig (20)

Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...
Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...
Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...
Jaleel Ahmed Gulammohiddin
Orangescrum Time Log Gold add-on User Manual
Orangescrum Time Log Gold add-on User Manual Orangescrum Time Log Gold add-on User Manual
Orangescrum Time Log Gold add-on User Manual
Orangescrum
Install
InstallInstall
Install
Said Chatir
Book hudson
Book hudsonBook hudson
Book hudson
Suresh Kumar
Fedora 17-installation guide-en-us
Fedora 17-installation guide-en-usFedora 17-installation guide-en-us
Fedora 17-installation guide-en-us
nelson-10
Using Open Source Tools For STR7XX Cross Development
Using Open Source Tools For STR7XX Cross DevelopmentUsing Open Source Tools For STR7XX Cross Development
Using Open Source Tools For STR7XX Cross Development
Giacomo Antonino Fazio
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
webhostingguy
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
webhostingguy
Plesk 8.1 for Linux/UNIX
Plesk 8.1 for Linux/UNIXPlesk 8.1 for Linux/UNIX
Plesk 8.1 for Linux/UNIX
webhostingguy
Red hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-usRed hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-us
muhammad adeel
IBM PowerLinux Open Source Infrastructure Services Implementation and T
IBM PowerLinux Open Source Infrastructure Services Implementation and TIBM PowerLinux Open Source Infrastructure Services Implementation and T
IBM PowerLinux Open Source Infrastructure Services Implementation and T
IBM India Smarter Computing
D space manual 1.5.2
D space manual 1.5.2D space manual 1.5.2
D space manual 1.5.2
tvcumet
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
webhostingguy
Plesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIXPlesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIX
webhostingguy
Getting Started Guide
Getting Started GuideGetting Started Guide
Getting Started Guide
webhostingguy
Dreamweaver cs5 help
Dreamweaver cs5 helpDreamweaver cs5 help
Dreamweaver cs5 help
Php RedStorm
Dreamweaver cs5 help
Dreamweaver cs5 helpDreamweaver cs5 help
Dreamweaver cs5 help
ok71
Configuration of sas 9.1.3
Configuration of sas 9.1.3Configuration of sas 9.1.3
Configuration of sas 9.1.3
satish090909
CIS Docker Benchmark v1.5.0 PDF.pdf
CIS Docker Benchmark v1.5.0 PDF.pdfCIS Docker Benchmark v1.5.0 PDF.pdf
CIS Docker Benchmark v1.5.0 PDF.pdf
SantanuJoshi3
Plesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIXPlesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIX
webhostingguy
Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...
Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...
Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...
Jaleel Ahmed Gulammohiddin
Orangescrum Time Log Gold add-on User Manual
Orangescrum Time Log Gold add-on User Manual Orangescrum Time Log Gold add-on User Manual
Orangescrum Time Log Gold add-on User Manual
Orangescrum
Fedora 17-installation guide-en-us
Fedora 17-installation guide-en-usFedora 17-installation guide-en-us
Fedora 17-installation guide-en-us
nelson-10
Using Open Source Tools For STR7XX Cross Development
Using Open Source Tools For STR7XX Cross DevelopmentUsing Open Source Tools For STR7XX Cross Development
Using Open Source Tools For STR7XX Cross Development
Giacomo Antonino Fazio
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
webhostingguy
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
webhostingguy
Plesk 8.1 for Linux/UNIX
Plesk 8.1 for Linux/UNIXPlesk 8.1 for Linux/UNIX
Plesk 8.1 for Linux/UNIX
webhostingguy
Red hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-usRed hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-us
muhammad adeel
IBM PowerLinux Open Source Infrastructure Services Implementation and T
IBM PowerLinux Open Source Infrastructure Services Implementation and TIBM PowerLinux Open Source Infrastructure Services Implementation and T
IBM PowerLinux Open Source Infrastructure Services Implementation and T
IBM India Smarter Computing
D space manual 1.5.2
D space manual 1.5.2D space manual 1.5.2
D space manual 1.5.2
tvcumet
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
webhostingguy
Plesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIXPlesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIX
webhostingguy
Getting Started Guide
Getting Started GuideGetting Started Guide
Getting Started Guide
webhostingguy
Dreamweaver cs5 help
Dreamweaver cs5 helpDreamweaver cs5 help
Dreamweaver cs5 help
Php RedStorm
Dreamweaver cs5 help
Dreamweaver cs5 helpDreamweaver cs5 help
Dreamweaver cs5 help
ok71
Configuration of sas 9.1.3
Configuration of sas 9.1.3Configuration of sas 9.1.3
Configuration of sas 9.1.3
satish090909
CIS Docker Benchmark v1.5.0 PDF.pdf
CIS Docker Benchmark v1.5.0 PDF.pdfCIS Docker Benchmark v1.5.0 PDF.pdf
CIS Docker Benchmark v1.5.0 PDF.pdf
SantanuJoshi3
Plesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIXPlesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIX
webhostingguy

Recently uploaded (20)

Mathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptxMathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptx
ppkmurthy2006
Industrial Valves, Instruments Products Profile
Industrial Valves, Instruments Products ProfileIndustrial Valves, Instruments Products Profile
Industrial Valves, Instruments Products Profile
zebcoeng
Lessons learned when managing MySQL in the Cloud
Lessons learned when managing MySQL in the CloudLessons learned when managing MySQL in the Cloud
Lessons learned when managing MySQL in the Cloud
Igor Donchovski
How to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using ArduinoHow to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using Arduino
CircuitDigest
Cyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptxCyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptx
Harshith A S
Wireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdfWireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdf
AbhinandanMishra30
CS3451 INTRODUCTIONN TO OS unit ONE .pdf
CS3451 INTRODUCTIONN TO OS unit ONE .pdfCS3451 INTRODUCTIONN TO OS unit ONE .pdf
CS3451 INTRODUCTIONN TO OS unit ONE .pdf
PonniS7
How Engineering Model Making Brings Designs to Life.pdf
How Engineering Model Making Brings Designs to Life.pdfHow Engineering Model Making Brings Designs to Life.pdf
How Engineering Model Making Brings Designs to Life.pdf
Maadhu Creatives-Model Making Company
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptxRAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
JenTeruel1
Unit II: Design of Static Equipment Foundations
Unit II: Design of Static Equipment FoundationsUnit II: Design of Static Equipment Foundations
Unit II: Design of Static Equipment Foundations
Sanjivani College of Engineering, Kopargaon
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptxMathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
ppkmurthy2006
Turbocor Product and Technology Review.pdf
Turbocor Product and Technology Review.pdfTurbocor Product and Technology Review.pdf
Turbocor Product and Technology Review.pdf
Totok Sulistiyanto
04 MAINTENANCE OF CONCRETE PAVEMENTS.ppt
04  MAINTENANCE OF CONCRETE PAVEMENTS.ppt04  MAINTENANCE OF CONCRETE PAVEMENTS.ppt
04 MAINTENANCE OF CONCRETE PAVEMENTS.ppt
sreenath seenu
Cloud Computing concepts and technologies
Cloud Computing concepts and technologiesCloud Computing concepts and technologies
Cloud Computing concepts and technologies
ssuser4c9444
google_developer_group_ramdeobaba_university_EXPLORE_PPT
google_developer_group_ramdeobaba_university_EXPLORE_PPTgoogle_developer_group_ramdeobaba_university_EXPLORE_PPT
google_developer_group_ramdeobaba_university_EXPLORE_PPT
JayeshShete1
Piping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdfPiping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdf
OMI0721
US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...
US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...
US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...
Thane Heins NOBEL PRIZE WINNING ENERGY RESEARCHER
TM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdfTM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdf
ChungLe60
Multi objective genetic approach with Ranking
Multi objective genetic approach with RankingMulti objective genetic approach with Ranking
Multi objective genetic approach with Ranking
namisha18
AI, Tariffs and Supply Chains in Knowledge Graphs
AI, Tariffs and Supply Chains in Knowledge GraphsAI, Tariffs and Supply Chains in Knowledge Graphs
AI, Tariffs and Supply Chains in Knowledge Graphs
Max De Marzi
Mathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptxMathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptx
ppkmurthy2006
Industrial Valves, Instruments Products Profile
Industrial Valves, Instruments Products ProfileIndustrial Valves, Instruments Products Profile
Industrial Valves, Instruments Products Profile
zebcoeng
Lessons learned when managing MySQL in the Cloud
Lessons learned when managing MySQL in the CloudLessons learned when managing MySQL in the Cloud
Lessons learned when managing MySQL in the Cloud
Igor Donchovski
How to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using ArduinoHow to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using Arduino
CircuitDigest
Cyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptxCyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptx
Harshith A S
Wireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdfWireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdf
AbhinandanMishra30
CS3451 INTRODUCTIONN TO OS unit ONE .pdf
CS3451 INTRODUCTIONN TO OS unit ONE .pdfCS3451 INTRODUCTIONN TO OS unit ONE .pdf
CS3451 INTRODUCTIONN TO OS unit ONE .pdf
PonniS7
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptxRAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
JenTeruel1
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptxMathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
ppkmurthy2006
Turbocor Product and Technology Review.pdf
Turbocor Product and Technology Review.pdfTurbocor Product and Technology Review.pdf
Turbocor Product and Technology Review.pdf
Totok Sulistiyanto
04 MAINTENANCE OF CONCRETE PAVEMENTS.ppt
04  MAINTENANCE OF CONCRETE PAVEMENTS.ppt04  MAINTENANCE OF CONCRETE PAVEMENTS.ppt
04 MAINTENANCE OF CONCRETE PAVEMENTS.ppt
sreenath seenu
Cloud Computing concepts and technologies
Cloud Computing concepts and technologiesCloud Computing concepts and technologies
Cloud Computing concepts and technologies
ssuser4c9444
google_developer_group_ramdeobaba_university_EXPLORE_PPT
google_developer_group_ramdeobaba_university_EXPLORE_PPTgoogle_developer_group_ramdeobaba_university_EXPLORE_PPT
google_developer_group_ramdeobaba_university_EXPLORE_PPT
JayeshShete1
Piping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdfPiping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdf
OMI0721
TM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdfTM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdf
ChungLe60
Multi objective genetic approach with Ranking
Multi objective genetic approach with RankingMulti objective genetic approach with Ranking
Multi objective genetic approach with Ranking
namisha18
AI, Tariffs and Supply Chains in Knowledge Graphs
AI, Tariffs and Supply Chains in Knowledge GraphsAI, Tariffs and Supply Chains in Knowledge Graphs
AI, Tariffs and Supply Chains in Knowledge Graphs
Max De Marzi

Runs scored by Players Analysis with Flume and Pig

  • 1. Runs scored by Players Analysis with Flume and Pig Nitesh Ghosh
  • 2. Contents Problem Statement.................................................................................................................................................3 Solution Architecture..............................................................................................................................................4 Software and Tools Specification .............................................................................................................................5 Solution Description................................................................................................................................................6 Program Code.........................................................................................................................................................7 Conclusion..............................................................................................................................................................9
  • 3. Problem Statement Data Set: This is data for runs scored by players in different countries in different years. Lets assume some external process is writing data into a directory in CSV format where columns of the data are as shown below: Problem Statement: Assume data is copied periodically into /home/cloudera/runs directory. Write a flume configuration to copy this data to HDFS using flume and then write a PIG script to process data using PIG to find out sum of run scored and balls played by each player.
  • 4. Solution Architecture Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. The diagram above shows a high-level view of how Apache Flume interacts with Agent Service and gets data to HDFS using Flume components - source, channel and sink, once data loads to HDFS. Using Apache Pig we then visualize the data. Apache Flume is a data ingestion system that is configured by defining endpoints in a data flow called sources and sinks. In Flume, each individual piece of data is called event sources, which produce events, and send the events through a channel, which connects the source to the sink. The sink then writes the events out to a predefined location.
  • 5. Software and Tools Specification Oracle Virtual Box- Version5.2.8r121009 (Qt5.6.2) Ubantu 16.04 LTS Apache Hadoop -Version2.7.6(ClusterEnvironment) Apache Hive- Version2.3.3(SetuponEdge Node) Apache Flume- Version0.17.0
  • 6. Solution Description We needtosetupHDFS fromHadoop eco-systemsothatothercomponentlike flumeandpig can work on that. In order to setup that, we need to download files from apache website and do the installation on Ubantu machine. After a successful installation, we can then verify whether Hadoop is installed on machine successfully.
  • 7. Program Code Once successfullysetupwithHDFSwe needtoconfigure Flume andsetupconfigurationfilesforflume. Place the configurationfileinside flume/confdirectory. We needtomake twochanges inside.conf asfollowed. Agent1.sources.source1_1.spoolDirissetwithinputpathas inlocal file systempath. Agent1.sinks.hdfs-sink1_1.hdfs.pathissetwithoutputpathasin HDFS path. ConfigurationDetails agent1.channels.fileChannel1_1.type=file agent1.channels.fileChannel1_1.capacity=200000 agent1.channels.fileChannel1_1.transactionCapacity=1000 agent1.sources.source1_1.type =spooldir agent1.sources.source1_1.spoolDir=/home/hadoopuser/Downloads/tmpload agent1.sources.source1_1.fileHeader=false agent1.sources.source1_1.fileSuffix =.COMPLETED agent1.sinks.hdfs-sink1_1.type=hdfs agent1.sinks.hdfs-sink1_1.hdfs.path=hdfs://localhost:9000/user/cloudera/flume_sink agent1.sinks.hdfs-sink1_1.hdfs.batchSize =1000 agent1.sinks.hdfs-sink1_1.hdfs.rollSize =268435456 agent1.sinks.hdfs-sink1_1.hdfs.rollInterval=0 agent1.sinks.hdfs-sink1_1.hdfs.rollCount=50000000 agent1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text agent1.sinks.hdfs-sink1_1.hdfs.fileType=DataStream agent1.sources.source1_1.channels=fileChannel1_1 agent1.sinks.hdfs-sink1_1.channel =fileChannel1_1 agent1.sinks= hdfs-sink1_1 agent1.sources=source1_1 agent1.channels=fileChannel1_1 File placedontmploadfolder
  • 8. Flume storingfile insideHDFSwe can see inabove screen. Nowwe needtosetupPigto analysisdatawhichisstoredon HDFS. A= LOAD '/user/cloudera/flume_sink/FlumeData.1526646743902' USINGPigStorage(' ') AS (Player_id:int, Year:chararray, Country:chararray,Opposition_Team:chararray, Runs_Scored:int, Balls_Played:int); B = FOREACH A GENERATE Player_id, Year, Country, Opposition_Team, Runs_Scored, Balls_Played; C = GROUP B BY Player_id; D = foreach C generate group,SUM(B.Runs_Scored); D = foreach C generate group,SUM(B.Runs_Scored),SUM(B.Balls_Played);
  • 9. Conclusion Folder Logging/Spooling is a wide branch for analysis. We have a number of applications which send and place their applicationfeeds, sothat reporting tools can analyze on top of that data and organization can take benefit and growth withthe data. Inthis projectwe have done analysis based on csv data, which keeps feeding on periodic time using Pig language which we have visualized.