This document provides an overview of the Cascading and Lingual frameworks:
- Cascading is a Java API that separates business logic from data integration and allows SQL and non-SQL flows to work together. Lingual provides an ANSI SQL interface on top of Cascading.
- Lingual provides two interfaces - a Cascading API and a JDBC driver API. This allows systems and people to directly query Hadoop data.
- Examples demonstrate running SQL queries from R using the JDBC driver, integrating custom data formats via provider APIs, and visualizing/monitoring Cascading applications.
2. ? Not a ¡°data scientist¡±
? No idea what ¡°big data¡± means
? Used MR in anger once, and did it wrong
? Author of Cascading
? Co-Author of Lingual (w/ Julian Hyde)
CHRISKWENSEL
2
11. "the speed of innovation is
proportional to the arrival rate of
answers to questions"
HADOOP&BIGDATA
11
12. True when you are questioning
Data, Algorithms, and
Architecture
CASCADING
12
13. ? Java API (alternative to Hadoop MapReduce)
? Separates business logic from integration
? Testable at every lifecycle stage
? Works with any JVM language
? Many integration adapters
CASCADING
13
Process Planner
Processing API Integration API
Scheduler API
Scheduler
Compute
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy
Enterprise Java
15. ? Started in 2007
? 2.0 released June 2012
? 2.5 stable out now
? 3.0 wip now available
? Tez support coming soon
? Apache Licensed Open-Source
? Supports all Hadoop 1 & 2 distros
CASCADING
15
21. Liberate the data trapped on Hadoop w/o
involving an Engineer
WHYLINGUAL?
21
22. ? ANSI Compatible SQL
? JDBC Driver
? Cascading Java API
? SQL Command Shell
? Catalog Manager Tool
? Data Provider API
LINGUAL
22
Query Planner
JDBC API Lingual APIProvider API
Cascading
Compute
Lingual
Data Stores
CLI / Shell Enterprise Java
Catalog
23. ? SQL-92
? Character, Numeric, and Temporal types
? IN and CASE
? FROM sub-queries
? CAST and CONVERT
? CURRENT_*
ANSISQL
23
http://docs.cascading.org/lingual/1.1/#sql-support
32. select dept_no, avg( max_salary ) from employees.dept_emp,
( select emp_no as sal_emp_no, max( salary ) as max_salary from employees.salaries
group by emp_no )
where dept_emp.emp_no = sal_emp_no group by dept_no;
SUB-QUERY
32
33. ACCESSHADOOPFROMR
33
# load the JDBC package!
library(RJDBC)!
?!
# set up the driver!
drv <- JDBC("cascading.lingual.jdbc.Driver", !
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")!
?!
# set up a database connection to a local repository!
connection <- dbConnect(drv, !
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")!
?!
# query the repository: in this case the MySQL sample database (CSV files)!
df <- dbGetQuery(connection, !
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!
head(df)!
?!
# use R functions to summarize and visualize part of the data!
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!
summary(df$hire_age)!
!
library(ggplot2)!
m <- ggplot(df, aes(x=hire_age))!
m <- m + ggtitle("Age at hire, people named Gina")!
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
36. ? Any Cascading Tap and/or Scheme can be used from JDBC
? Use a ¡°fat jar¡± on local disk or from a Maven repo
? cascading-jdbc:cascading-jdbc-oracle-provider:1.0
? The Jar is dynamically loaded into cluster, on the ?y
DATAPROVIDERAPI
36
40. ? Understand how your application maps onto your cluster
? Identify bottlenecks (data, code, or the system)
? Jump to the line of code implicated on a failure
? Plugin available via Maven repo
? Beta UI hosted online
DRIVEN
40
http://cascading.io/driven/