This document summarizes a technical seminar on Banian, a cross-platform interactive query system for structured big data. The seminar covered the introduction of big data systems developed by companies like Google, Facebook, and Baidu. It also discussed Banian's system architecture, splitting and scheduling approach, and ability to perform cross-platform queries. The evaluation section showed that Banian's performance was 5-30 times better than Hive and that it has good scalability and compatibility.
1 of 21
Downloaded 42 times
More Related Content
banian
1. BNM INSTITUTE OF TECHNOLOGY
A Technical Seminar on
BANIAN: A CROSS-PLATFORM INTERACTIVE
QUERY SYSTEM FOR STRUCTURED BIG DATA
Under the Guidance of
Ms. Tulasi Sunitha M.
Assistant Professor
Dept. of CSE, BNMIT
4. INTRODUCTION
The GFS and MapReduce developed by Google could process 20 PB of
webpages per day in 2007.
The HDFS and HBase clusters developed by Facebook scanned 300
million images daily in 2012.
The search engine system developed by Baidu, could handle 100 PB of
data per day in 2013.
5. Continued.
At present, parallel database based on Massively Parallel Processing (MPP)
architecture can manage hundreds of TB of data.
MapReduce is a programming framework proposed by Google and a
typical technology for processing big data.
By combining HDFS with the splitting and scheduling model, Banian
effectively integrates large-scale storage management with interactive
query and analysis.
6. LITERATURE SURVEY
One line of research is incorporating MapReduce on the basis of MPP
database, such as Greenplum and Teradata.
Hive is the most typical example of SQL on Hadoop. It is used to map files
onto a database table and provide an SQL query interface.
Dremel is an interactive data analysis system proposed by Google.
7. Continued.
Impala is an MPP SQL query engine developed by Cloudera.
BlinkDB proposed by UC Berkeley is a large scale parallel processing
engine capable of running interactive SQL commands on PB level datasets.
Spark originated from the cluster computing platform at AMPLab, UC
Berkeley.
8. SYSTEM ARCHITECTURE
The architecture of Banian, which is divided into three main layers
according to logic functions: the storage layer, scheduling and execution
layer, and application layer.
9. Continued.
The storage layer contains three important interfaces as well,
I. The interface used for providing the data block distribution
information of the file to the scheduler module through NameNode;
II. The read/write interface of local data to the query engine module;
III. The read/write interface of HDFS to the ETL module.
10. Continued.
The scheduling and execution layer is the core component of Banian. It
contains three modules: Scheduler, Query Engine, and Metadata Server.
The scheduler receives SQL commands from the application layer.
The metadata server maintains a fast lookup table for caching data
block information.
The query engine is deployed on each sub-node. It is responsible for
receiving and executing the operation list allocated by the scheduler.
12. Continued.
The complete workflow of the scheduling and execution layer processing
SQL commands.
Grammatical and lexical analysis is conducted by the execution and
analysis units to generate the task tree after receiving SQL commands.
Traverse each entry on the task tree, query metadata server according to
table information, and obtain the corresponding file information.
Transform tasks into file operations, i.e., task tree into operation tree.
Query the fast lookup table, and go to Step 5 in the case of cache hit.
13. Traverse each entry on the operation tree, query HDFS NameNode
according to file information, and obtain the corresponding data block
position.
The coordinator unit sends the operation list to the query engine on the
corresponding sub-node.
The query engine initiates the workflow after receiving the operation list
and directly reads local data for further execution.
The aggregation unit collects all results from query engine and sends them
to the application layer.
14. Scheduler
The scheduler is a logical unit as opposed to a physical module. It is
composed of the scheduler daemons on each physical node.
16. Continued.
The SQL interface provides a command shell for users and forwards query
commands to the crossplatform module.
The crossplatform module queries the global table and gets the information of
Location.
The global table stores the configuration information of all platforms using a
data structure called Location.
struct Location{
char *tagname;
char *host;
int port;
int authority;
char *username;
char *password;
}
18. 2. TPC-H evaluation.
Figure 6.1: (a) Query time of Q1-Q5 on Banian and Hive using 1.2 PB dataset
D1. (b) query time of 22 SQL commands of TPC-H benchmark on banian and
Hive using 1TB dataset D2.
Load dataset D2 into Banian and Hive, and run a suite of business oriented
ad-hoc queries (22 SQL commands) from the TPC-H benchmark on our
experimental platform.
19. 3. Scalability evaluation.
Split dataset D1 and each node retains 12 TB of data. The table size
increase from 120 TB to 1.2 PB as the cluster size increases from 10
nodes to 100 nodes.
Figure 6.2: Query time of Q1-Q5 on Banian and Hive for cluster
size of 10, 20, 40, 60, 80 and 100 in sequence.
20. CONCLUSIONS
Banian combines HDFS with the splitting and scheduling engine of parallel
database.
This platform supports the storage of PB level data and interactive cross-
platform query.
The test results suggest that the performance of Banian is 530 times better
than that of Hive.
Banian employs a symmetrical structure having a loose coupling degree
and shows higher scalability and compatibility.
21. REFERENCES
[1] S. Ghemawat, H. Gobioff, and S. T. Leung, The Google file system, ACM SIGOPS
Operating Systems Review, vol. 37, no. 5, pp. 2943, 2003.
[2] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters,
Commun, of ACM, vol. 51, no. 1, pp. 107113, 2008.
[3] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop distributed file system, in
Proceedings of IEEE Conference on Mass Storage Systems and Technologies (MSST),
2010, pp. 110.
[4] HBase project, http://hbase.apache.org/, 2014.
[5] M. Li, L. Andrey, T. Sasu, and Y. Antti, MPTCP incast in data center networks, China
Communications, vol. 11, no. 4, pp. 2537, 2014.
[6] Greenplum Inc., Greenplum Database: Powering the data driven enterprise, the resources
http://www.greenplum.com/resources, 2014.