This document discusses interactive notebooks for working with data. Notebooks allow users to explore data, create models, and share work in a centralized, interactive web interface. Popular notebook platforms include Jupyter, Apache Zeppelin, Spark Notebook, and RStudio. Notebooks provide benefits like interactivity, centralized access to data, and mixing of code and documentation but also have downsides like security risks, lack of versioning, and challenges in production. The document concludes by discussing risks and side effects of notebooks in enterprises, including new needs for data governance and lifecycle management.
5. www.kensu.io
A. NOTEBOOKS: WHAT ARE THEY
i. What is working on Data"
ii. Power of Interactivity
iii. Centralisation and Share-ability
6. www.kensu.io
A. NOTEBOOKS: WHAT ARE THEY
i. What is working on Data"
1. elaborate a business opportunity plan or hypothesis
2. re鍖ne the goals with the help of business team
3. discover available data source potentially interesting
4. connect to the data source (or copy)
5. explore the data source content to get to know it better
6. create 鍖rst models
7. decide if results are good to create the (data) product
8. if not,
1. decide if data is worth keeping or enough
2. back to 2
7. www.kensu.io
A. NOTEBOOKS: WHAT ARE THEY
ii. Power of Interactivity
A data project (decisional project) goes along with anxiety.
The time to 鍖rst results is rather long due to complexity.
The complexity can be due to:
- the data,
- the availability of data
- the environment,
- the business,
- the security,
-
8. www.kensu.io
A. NOTEBOOKS: WHAT ARE THEY
ii. Power of Interactivity
And the 鍖rst results wont (highly) probably be good!
- sense of lack of visibility
- problem of communication
- failure
If these projects are considered as IT projects, it leads to
9. www.kensu.io
A. NOTEBOOKS: WHAT ARE THEY
ii. Power of Interactivity
The need of frequently making tries and errors resulted in
1. explosion of dynamic languages (rather than C for instance)
2. and interpreted languages
Like Python, R !!
Leading to data projects mostly driven from shell exploration
and released as scripts
10. www.kensu.io
A. NOTEBOOKS: WHAT ARE THEY
ii. Power of Interactivity
The BI tools alternative is still valid.
However too constrictive to unleash the power of data science.
However Shell and script are awful tool for programming:
- line by line editing
- not persisted
- not shareable
11. www.kensu.io
A. NOTEBOOKS: WHAT ARE THEY
iii. Centralisation and Share-ability
1. Web based
2. Direct results (incremental context)
3. Shareable (e.g. JSON)
To 鍖x these problems, the community created notebooks.
Well notebooks alike already existed however (e.g. matlab)
Notebooks implementations started with IPython
(and are following the same rules)
12. www.kensu.io
A. NOTEBOOKS: WHAT ARE THEY
iii. Centralisation and Share-ability
1. access to data directly and run experiments
2. be installed as a service and centralise security
3. can be shared (well easily compared to shell script)
Notebooks can
13. www.kensu.io
B. NOTEBOOKS: WHICH ONES
i. Jupyter
http://jupyter.org/
ii. Apache Zeppelin
https://zeppelin.apache.org/
iii. Spark Notebook
http://spark-notebook.io/
iv. RStudio
https://www.rstudio.com/
v. (proprietary) Databricks
https://bit.ly/2U1xPlw
14. www.kensu.io
C. NOTEBOOKS: PROS
i. Interactivity
ii. Centralised
iii. Mix code and documentation
iv. Communication (IT <-> Data Folks)
v. BI Tool alternative
15. www.kensu.io
C. NOTEBOOKS: CONS
i. Security backdoor
ii. Highly dynamic, no traceability
iii. No/poor versioning
iv. Non-linear (code)
v. Non modular
vi. Poor production-readiness
16. www.kensu.io
X. NOTEBOOKS: MY2蔵
i. Why do I have created Spark Notebook
ii. Pick yours
Zeppelin for data engineers,
Jupyter for data scientists
RStudio for R folks
Spark Notebook for Scala and/or Spark