Vectorized Processing in a Nutshell. (in Korean)
Presented by Hyoungjun Kim, Gruter CTO and Apache Tajo committer, at DeView 2014, Sep. 30 Seoul Korea.
Big Data Platform Field Case in MelOn (in Korean)
- Presented by Byeong-hwa Yoon, engineer manager at Loen Entertainment
- at Gruter TECHDAY 2014 Oct. 29 Seoul, Korea
Introduction to Apache Tajo: Data Warehouse for Big DataGruter
油
Apache Tajo is a top-level Apache project functioning as a big data warehouse, enabling ANSI-SQL compliant querying across diverse storage types while ensuring fault tolerance and efficient execution. The recent release highlights features like query federation, JDBC-based storage support, and self-describing data formats, enhancing performance and user experience. Tajo finds applications in various sectors, providing solutions for data analysis, ETL workloads, and ad-hoc queries, significantly reducing operational costs.
Big Data Platform Field Case in MelOn (in Korean)
- Presented by Byeong-hwa Yoon, engineer manager at Loen Entertainment
- at Gruter TECHDAY 2014 Oct. 29 Seoul, Korea
Introduction to Apache Tajo: Data Warehouse for Big DataGruter
油
Apache Tajo is a top-level Apache project functioning as a big data warehouse, enabling ANSI-SQL compliant querying across diverse storage types while ensuring fault tolerance and efficient execution. The recent release highlights features like query federation, JDBC-based storage support, and self-describing data formats, enhancing performance and user experience. Tajo finds applications in various sectors, providing solutions for data analysis, ETL workloads, and ad-hoc queries, significantly reducing operational costs.
Introduction to Apache Tajo: Future of Data WarehouseGruter
油
Apache Tajo is an SQL-on-Hadoop system designed for efficient data processing, specializing in both long-running ETL jobs and low-latency interactive analysis. It supports ANSI-SQL standard compliance and various data formats, offering optimized performance and query plan optimization. Tajo has use cases in significant organizations like SK Telecom and Bluehole Studio, demonstrating its effectiveness in reducing analysis time and facilitating data-driven decisions.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
油
This document summarizes key aspects of structuring computation and data in Apache Spark using SQL, DataFrames, and Datasets. It discusses how structuring computation and data through these APIs enables optimizations like predicate pushdown and efficient joins. It also describes how data is encoded efficiently in Spark's internal format and how encoders translate between domain objects and Spark's internal representations. Finally, it introduces structured streaming as a high-level streaming API built on top of Spark SQL that allows running the same queries continuously on streaming data.
Hive on spark is blazing fast or is it finalHortonworks
油
The document discusses the performance improvements of Apache Hive, particularly with the Stinger initiative that aimed to enhance its speed by 100x through better execution engines and columnar storage. It compares the performance of Hive on various execution engines, such as Tez and Spark, presenting benchmarks that show Hive on Tez is significantly faster than Hive on Spark for large datasets. The conclusion highlights Hive's evolving capabilities to potentially achieve sub-second query performance.
The document provides an overview of Apache Tajo, a big data warehouse system, highlighting its features, architecture, and use cases. The latest version, Tajo 0.10, includes improved support for HBase and AWS, as well as enhancements in SQL tool support and query performance. Future developments, such as nested data support and tablespace features, aim to enhance usability and integration with various data sources.
Efficient In足situ Processing of Various Storage Types on Apache TajoGruter
油
The document discusses Apache Tajo, an open source data warehouse system that supports efficient in-situ processing of various storage types. It describes Tajo's architecture, how it supports different storage backends like HDFS, S3, HBase, and SQL databases. The document outlines Tajo's design for separating storage from data format and using tablespaces to manage different storage configurations. It also covers query optimization, operation pushdown, and status of current storage and format support in Tajo.
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
油
The document discusses real-time data processing in the telecommunications sector, specifically highlighting projects at SK Telecom that utilize Spark and Tajo for high-speed data processing, in-stream processing, and manufacturing optimization. It outlines the challenges of processing large volumes of data, the shift from traditional batch jobs to real-time analytics, and the benefits of using streaming technologies to achieve lower latency and higher efficiency. Key lessons emphasize the successful implementation of OLTP-style systems alongside in-stream processing capabilities.
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
油
This document discusses setting up and using Tajo, an Apache Hadoop-based data warehousing system, on AWS. It provides instructions on using Tajo Cloud to easily configure a Tajo cluster on AWS. Examples show how to connect external data from S3, perform queries, and analyze customer cohort data to understand purchase patterns over time. Tajo allows direct access to data in S3 and dynamic scaling of worker nodes, and its connector enables remote querying from SQL clients, Excel, and R.
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
油
Apache Tajo is a big data warehouse system built on Hadoop that supports SQL and provides advanced query optimization and distributed processing capabilities. Introduced as an Apache top-level project in March 2014, it offers features for complex query execution and performance improvements over similar platforms like Hive and Impala. The document details Tajo's architecture, use cases, performance benchmarks, and integration with the Hadoop ecosystem.
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter
油
The document discusses a case study of building an open source search engine. It describes how an open source search engine was implemented for a company to address issues with decreasing click-through rates and lack of technical knowledge. Elasticsearch was chosen as it allows for real-time search and analytics, is distributed and highly available. The implementation involved using Zookeeper for node management, caching for performance improvements, and custom plugins for functions like scoring. This allowed the company to gain technical capabilities, provide differentiated search, reduce costs, and better manage search quality and service integration.
Apache Tajo is a big data warehouse system designed for Hadoop that offers SQL standard support with advanced query optimization and distributed processing capabilities. It enables efficient ETL and interactive analytics, outperforming traditional systems like Hive and Impala. Tajo operates seamlessly within the Hadoop ecosystem and is supported by a strong open-source community.
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
油
Apache Tajo is a big data warehouse system built on Hadoop, supporting SQL standards and providing low-latency queries. The project is currently in beta, with key features developed and is fully community-driven. Future developments include improvements in scheduling and execution speed, among other optimizations.
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
油
The document outlines the features and optimization techniques of Apache Tajo, an open-source SQL-on-Hadoop system, focusing on query optimization and JIT-based vectorized execution. Key discussions include join order optimization, progressive optimization, and vectorized processing for improving query performance, as well as detailed implementation strategies. Future plans involve enhancing the system's capabilities to further optimize processing efficiency and reduce CPU usage.
This document summarizes the results of a performance test comparing Hive, Impala, and Tajo on queries against a 1.7TB dataset. Tajo outperformed Hive and Impala on scans with filters and joins. For queries with grouping, aggregation, and sorting, Tajo was faster than Hive and similar to or faster than Impala. The author concludes that even though Tajo materializes all results to HDFS, its performance is promising compared to Impala due to its dynamic task scheduling. Further performance enhancements are expected as the Tajo project continues.
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventGruter
油
This document introduces Hyunsik Choi and provides an overview of his background and experience. It states that he has a PhD in computer science and engineering, is currently the director of research at Gruter Corp in Seoul, and has been a long-time open source contributor to Apache Tajo and Apache Giraph. The document also outlines his plan to give a presentation introducing Apache Tajo, covering topics like its architecture, distributed processing model, and query optimization approach.
2. About me
Gruter Corp / BigData Engineer
Apache Tajo Committer
jhjung@gruter.com
Home Page: http://blrunner.com
Twitter: @blrunner78
The author of Hadoop book
6. 6
1.2 Tajo ろ豌
Master Server (HA)
Client
JDBC TSql Web UI
CatalogStore
DBMS
HiveMetastor
e
Submit a Query
Manage metadata
Allocate a query
Send task
& monitor
Send task
& monitor
Slave Server
TajoWorker
QueryMaster
Local
FileSystem
HDFS
Local Query
Engine
StorageManager
Slave Server
TajoWorker
QueryMaster
Local
FileSystem
HDFS
Local Query
Engine
StorageManager
Slave Server
TajoWorker
QueryMaster
Local
FileSystem
HDFS
Local Query
Engine
StorageManager
TajoMaster
TajoMaster