ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Airflow
Workflow Management System
previa [at] gmail.com
Outline
? Why Airflow ? & What about it
? Workflow (DAG)
? Architecture
? DEMO
? Scheduler
? Jinja2 & macros
? Command
? Backfill
? Q/A
Why Airflow ?
¡ñ How to control it ?
¡ð Complicate and implicit
¡ñ How to schedule it?
¡ð Time and event driven
¡ñ How to deal with failure ?
¡ð Retry, notify, logging
ETL Hell
REF:http://tinyurl.com/yclq3slz
Why Airflow ?
Considerations
¡ð General purpose
¡ð Backfilling
¡ð Rich UI
¡ð Flow define
REF: http://tinyurl.com/ycadwhta
Airflow introduction
Why Airflow ?
Considerations
¡ð General purpose
¡ð Backfilling
¡ð Rich UI
¡ð Flow define
We chose Airflow
REF: http://tinyurl.com/ycadwhta
About Airflow
¡ñ Open Sourced by Airbnb in June 2015
¡ñ Joined ASF¡¯s incubation program in March 2016
¡ñ 660+ contributors, 5.7k+ commits, 10k+ stars
¡ñ Used by 200+ companies :
¡ð Adobe, Airbnb, HBO, Intel, iFTTT, Lyft, PayPal, Pandora, Quora, Reddit,
similarweb, Tesla, Twitter, vevo, 9GAG, Square, Yahoo, ..
About Airflow
¡ñ Open Sourced by Airbnb in June 2015
¡ñ Joined ASF¡¯s incubation program in March 2016
¡ñ 660+ contributors, 5.7k+ commits, 10k+ stars
¡ñ Used by 200+ companies :
¡ð Adobe, Airbnb, HBO, Intel, iFTTT, Lyft, PayPal, Pandora, Quora, Reddit,
similarweb, Tesla, Twitter, vevo, 9GAG, Square, Yahoo, ..
¡ð Wondershare since Dec 2018
¡ñ Currently we got working flow use it
¡ð Order, User, sales insight, ¡­
¡ð Cluster maintenance
Task
Operator
TaskInstance
Workflow (DAG)
Workflow
DAG(Directed acyclic graph)
DagRun
REF:http://tinyurl.com/ydxyuscd
Workflow (DAG)
¡ñ Operator
¡ð Execute
¡ö Bash, Ssh, Python
¡ð DB
¡ö MySQL, MsSQL,
Postgres, Presto,
Redshift, Hive
¡ð WebService
¡ö Http, Email, Slack, S3
Workflow (DAG)
¡ñ Sensor
¡ð FileSensor, Database,
Sagemaker, Redshift, Imap,
Bigquery,, Cassandra, EMR,
Ftp, SFtp, Hdfs, Jira, Azure,
...
¡ñ Flow control
¡ð BranchPythonOperator
¡ö Pass data through tasks
=> XCom
Workflow (DAG)
? Skip
¨C Raise AirflowSkipExection
? Join (Trigger Rules)
Architecture(Multi-Node)
? parallelism
? dag_concurrency
? max_active_runs_per_dag
? non_pooled_task_slot_count
Architecture(Multi-Node)
REF:http://tinyurl.com/y6uzjcsg
Dag, Metadata
UI
DEMO
http://longfei.leanote.com/post/airflow-operations
Scheduler
¡ñ start_date: 2019-01-01
¡ñ schedule_interval: 1 1 * * *
# real run_date execution_date
N/A 2019-01-01 01:01
1 2019-01-02 01:01 2019-01-01
2 2019-01-03 01:01 2019-01-02
execution_date = start_date + #N-th * interval
¡ñ Airflow using UTC ( ~1.10.x)
¡ð Patch source code
¡ö http://tinyurl.com/yatqngh8
¡ð Modify dag file
¡ö http://tinyurl.com/y8kmzz65
¡ñ interval also support @hourly, @daily
¡ð using crontab form to avoid stress peak(´í·å)
¡ñ set depends_on_past to true if needed
Jinja2 & macros
? Leverage macros & jinja2
? Let airflow control time
¨C Don¡¯t manipulate time in code,
use args instead
Variable Description
{{ ds }} the execution date as YYYY-MM-DD
{{ ds_nodash }} the execution date as YYYYMMDD
{{ [yesterday, tomorrow]_ds }} yesterday¡¯s date as YYYY-MM-DD
{{ [yesterday, tomorrow]_ds_nodash }} yesterday¡¯s date as YYYYMMDD
{{ ts }} execution_date.isoformat()
{{ dag }}, {{ task }} the DAG object, the Task object
{{ task_instance }}, {{ ti }} the task_instance object
{{ params }} user-defined params dictionary
Command
? Run in background
¨C webserver
¨C scheduler
¨C flower / worker
? Backfill
¨C New metric need historical data
¨C Re-run failed tasks
¨C Options
? --donot_pickle, --dry_run,
--rerun_failed_tasks,
--Ignore_dependencies
? Develop & Test
¨C list_tasks / list_dags
¨C server_logs
¨C run / test
¨C trigger_dag
Security
? Support
¨C Password
¨C LDAP
¨C Custom Auth
¨C Kerberos
¨C OAuth
? Github Enterprise Auth
? Google OAuth
? By default
¨C All access are open
? Secure access via SSL (https)
? REF
¨C http://tinyurl.com/y8nkzvsy
Reference
¡ñ Developing elegant workflows in Python code with Apache Airflow
(2017 PyCon@Euro)
¡ð https://www.youtube.com/watch?v=XJf-f56JbFM&t=1257
¡ñ Modern ETL-ing with Python and Airflow (and Spark) - (2017
PyCon@De)
¡ð https://www.youtube.com/watch?v=tcJhSaowzUI
¡ñ Apache Airflow in Production: A Fictional Example
¡ð https://www.youtube.com/watch?v=iTg-a4icf_I
¡ñ A Practical Introduction to Airflow (2016 PyData@SF)
¡ð https://www.youtube.com/watch?v=cHATHSB_450
Q/A

More Related Content

Airflow introduction

  • 2. Outline ? Why Airflow ? & What about it ? Workflow (DAG) ? Architecture ? DEMO ? Scheduler ? Jinja2 & macros ? Command ? Backfill ? Q/A
  • 3. Why Airflow ? ¡ñ How to control it ? ¡ð Complicate and implicit ¡ñ How to schedule it? ¡ð Time and event driven ¡ñ How to deal with failure ? ¡ð Retry, notify, logging ETL Hell REF:http://tinyurl.com/yclq3slz
  • 4. Why Airflow ? Considerations ¡ð General purpose ¡ð Backfilling ¡ð Rich UI ¡ð Flow define REF: http://tinyurl.com/ycadwhta
  • 6. Why Airflow ? Considerations ¡ð General purpose ¡ð Backfilling ¡ð Rich UI ¡ð Flow define We chose Airflow REF: http://tinyurl.com/ycadwhta
  • 7. About Airflow ¡ñ Open Sourced by Airbnb in June 2015 ¡ñ Joined ASF¡¯s incubation program in March 2016 ¡ñ 660+ contributors, 5.7k+ commits, 10k+ stars ¡ñ Used by 200+ companies : ¡ð Adobe, Airbnb, HBO, Intel, iFTTT, Lyft, PayPal, Pandora, Quora, Reddit, similarweb, Tesla, Twitter, vevo, 9GAG, Square, Yahoo, ..
  • 8. About Airflow ¡ñ Open Sourced by Airbnb in June 2015 ¡ñ Joined ASF¡¯s incubation program in March 2016 ¡ñ 660+ contributors, 5.7k+ commits, 10k+ stars ¡ñ Used by 200+ companies : ¡ð Adobe, Airbnb, HBO, Intel, iFTTT, Lyft, PayPal, Pandora, Quora, Reddit, similarweb, Tesla, Twitter, vevo, 9GAG, Square, Yahoo, .. ¡ð Wondershare since Dec 2018 ¡ñ Currently we got working flow use it ¡ð Order, User, sales insight, ¡­ ¡ð Cluster maintenance
  • 9. Task Operator TaskInstance Workflow (DAG) Workflow DAG(Directed acyclic graph) DagRun REF:http://tinyurl.com/ydxyuscd
  • 10. Workflow (DAG) ¡ñ Operator ¡ð Execute ¡ö Bash, Ssh, Python ¡ð DB ¡ö MySQL, MsSQL, Postgres, Presto, Redshift, Hive ¡ð WebService ¡ö Http, Email, Slack, S3
  • 11. Workflow (DAG) ¡ñ Sensor ¡ð FileSensor, Database, Sagemaker, Redshift, Imap, Bigquery,, Cassandra, EMR, Ftp, SFtp, Hdfs, Jira, Azure, ... ¡ñ Flow control ¡ð BranchPythonOperator ¡ö Pass data through tasks => XCom
  • 12. Workflow (DAG) ? Skip ¨C Raise AirflowSkipExection ? Join (Trigger Rules)
  • 13. Architecture(Multi-Node) ? parallelism ? dag_concurrency ? max_active_runs_per_dag ? non_pooled_task_slot_count
  • 16. Scheduler ¡ñ start_date: 2019-01-01 ¡ñ schedule_interval: 1 1 * * * # real run_date execution_date N/A 2019-01-01 01:01 1 2019-01-02 01:01 2019-01-01 2 2019-01-03 01:01 2019-01-02 execution_date = start_date + #N-th * interval ¡ñ Airflow using UTC ( ~1.10.x) ¡ð Patch source code ¡ö http://tinyurl.com/yatqngh8 ¡ð Modify dag file ¡ö http://tinyurl.com/y8kmzz65 ¡ñ interval also support @hourly, @daily ¡ð using crontab form to avoid stress peak(´í·å) ¡ñ set depends_on_past to true if needed
  • 17. Jinja2 & macros ? Leverage macros & jinja2 ? Let airflow control time ¨C Don¡¯t manipulate time in code, use args instead Variable Description {{ ds }} the execution date as YYYY-MM-DD {{ ds_nodash }} the execution date as YYYYMMDD {{ [yesterday, tomorrow]_ds }} yesterday¡¯s date as YYYY-MM-DD {{ [yesterday, tomorrow]_ds_nodash }} yesterday¡¯s date as YYYYMMDD {{ ts }} execution_date.isoformat() {{ dag }}, {{ task }} the DAG object, the Task object {{ task_instance }}, {{ ti }} the task_instance object {{ params }} user-defined params dictionary
  • 18. Command ? Run in background ¨C webserver ¨C scheduler ¨C flower / worker ? Backfill ¨C New metric need historical data ¨C Re-run failed tasks ¨C Options ? --donot_pickle, --dry_run, --rerun_failed_tasks, --Ignore_dependencies ? Develop & Test ¨C list_tasks / list_dags ¨C server_logs ¨C run / test ¨C trigger_dag
  • 19. Security ? Support ¨C Password ¨C LDAP ¨C Custom Auth ¨C Kerberos ¨C OAuth ? Github Enterprise Auth ? Google OAuth ? By default ¨C All access are open ? Secure access via SSL (https) ? REF ¨C http://tinyurl.com/y8nkzvsy
  • 20. Reference ¡ñ Developing elegant workflows in Python code with Apache Airflow (2017 PyCon@Euro) ¡ð https://www.youtube.com/watch?v=XJf-f56JbFM&t=1257 ¡ñ Modern ETL-ing with Python and Airflow (and Spark) - (2017 PyCon@De) ¡ð https://www.youtube.com/watch?v=tcJhSaowzUI ¡ñ Apache Airflow in Production: A Fictional Example ¡ð https://www.youtube.com/watch?v=iTg-a4icf_I ¡ñ A Practical Introduction to Airflow (2016 PyData@SF) ¡ð https://www.youtube.com/watch?v=cHATHSB_450
  • 21. Q/A