�ݺ�ߣ

Table of Contents
1. Abstract 1
2. Architecture 1
3. Tools 1-2
4. Configurations 2-7
5. Code Snippets
6. Screenshots
7. References
Table of Figures
1 Log Analysis using Kafka Streaming 8
2 Log Analysis Web Page with Statistics
3 Top Endpoints
4 Frequent IP Addresses
5 Frequent IP Addresses Last Window
6 Spark Environment
7 Spark Jobs triggered during execution
8 RDD Storage
9 Streaming Statistics
10 Streaming Statistics after burst input

1
Abstract
This project aims at Analyzing the logs being streamed into spark using Kafka. This project has
an interactive Web Page to show log analysis of number of logs being streamed all time and
Last time window, response code counts, frequent IP Addresses and top-endpoints based on
request frequency.
Design and architecture:
Figure 1: Log Analysis using Kafka Streaming
Tools Used:
�� Scala 2.10
�� Java 8
�� Apache Spark 1.5.2
�� Apache Kafka 2.10.-0.8.2.0
�� Ubuntu Linux Server

2
Configurations:
Setting up a Multi - broker Kafka Cluster :
Start ZooKeeper
Kafka ships with a reasonable default ZooKeeper configuration for our simple use case. The
following command launches a local ZooKeeper instance.
bin/zookeeper-server-start.sh config/zookeeper.
Note : By default the ZooKeeper server will listen on *:2181/tcp.
Configure and start the Kafka brokers
We will create 2 Kafka brokers, whose configurations are based on the default
config/server.properties. Apart from the settings below the configurations of the brokers are
identical.
The first broker:
Create the config file for broker 1
cp config/server.properties config/server1.properties
Edit config/server1.properties and replace the existing config values as follows:
broker.id=1
port=9092
log.dir=/tmp/kafka-logs-1

3
The second broker:
Create the config file for broker 2
cp config/server.properties config/server2.properties
Edit config/server2.properties and replace the existing config values as follows:
broker.id=2
port=9093
log.dir=/tmp/kafka-logs-2
Now you can start each Kafka broker in a separate console:
Start first broker in its own terminal session:
bin/kafka-server-start.sh config/server1.properties
Start second broker in its own terminal session:
bin/kafka-server-start.sh config/server2.properties
Create a Kafka topic :
bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic topicOne --partitions 3 --
replication-factor 2

4
Commands:
Start Zookeeper:
bin/zookeeper-server-start.sh config/zookeeper.properties
Start Kafka Server (Broker):
bin/kafka-server-start.sh config/server.properties
Create topics:
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --
topic topicOne
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --
topic topicTwo
Start Producer:
bin/kafka-console-producer.sh --broker-list localhost:9091,localhost:9092 --topic topicOne
bin/kafka-console-producer.sh --broker-list localhost:9091,localhost:9092 --topic topicTwo
Spark Command to execute KafkaLogAnalyzerApplication :
Note :Locate jar file as per project hierarchy
bin/spark-submit --class "com.cs696.bigdata.loganalyzer.KafkaLogAnalyzerApplication" --
master local[20] projectFinal/app/java8/target/uber-log-analysis-1.0.jar --output_html_file
/tmp/log_stats.html

5
Code Snippets:
Integrating Apache Kafka with Log Analyzer Application:
//We stream in the logs through Apache Kafka using multiple brokers which will be configured in the
//producer.properties file under config directory
HashSet<String> topicsSet = new
HashSet<String>(Arrays.asList(LogAnalyzerFlags.getInstance().getTopics().split(",")));
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list",LogAnalyzerFlags.getInstance().getBrokers());
// Create Pair Input DStream kafka stream with brokers and topics
JavaPairInputDStream<String, String> logRecords = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);

6
Screen Shots:
Figure 2: Log Analysis Web Page with Statistics
Figure 3: Top Endpoints

7
Figure 4: Frequent IP Addresses
Figure 5: Frequent IP Addresses Last Window

8
Figure 6: Spark Environment
Figure 7: Spark Jobs triggered during execution

9
Figure 8: RDD Storage
Figure 9: Streaming Statistics

10
Figure 10: Streaming Statistics after burst input
References:
�� http://spark.apache.org/docs/latest/streaming-kafka-integration.html
�� http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-
cluster-on-a-single-node/
�� https://databricks.gitbooks.io/databricks-spark-reference-
applications/content/logs_analyzer/chapter1/streaming.html

�ݺ�ߣ

Final_Report_new (1)

Recommended

More Related Content

What's hot (20)

Similar to Final_Report_new (1) (20)

More from Adarsh Burma (7)

Final_Report_new (1)