This document provides an overview of Apache Airflow, an open-source workflow management system. It describes Airflow's key features like workflow definition using directed acyclic graphs (DAGs), rich UI, scheduler, operators for tasks like databases and web services, and use of Jinja templating. The document also discusses Airflow's architecture with parallel execution, UI, command line operations like backfilling, and security features. Airflow is used by over 200 companies for workflows like ETL, analytics, and machine learning pipelines.
2. Outline
? Why Airflow ? & What about it
? Workflow (DAG)
? Architecture
? DEMO
? Scheduler
? Jinja2 & macros
? Command
? Backfill
? Q/A
3. Why Airflow ?
¡ñ How to control it ?
¡ð Complicate and implicit
¡ñ How to schedule it?
¡ð Time and event driven
¡ñ How to deal with failure ?
¡ð Retry, notify, logging
ETL Hell
REF:http://tinyurl.com/yclq3slz
7. About Airflow
¡ñ Open Sourced by Airbnb in June 2015
¡ñ Joined ASF¡¯s incubation program in March 2016
¡ñ 660+ contributors, 5.7k+ commits, 10k+ stars
¡ñ Used by 200+ companies :
¡ð Adobe, Airbnb, HBO, Intel, iFTTT, Lyft, PayPal, Pandora, Quora, Reddit,
similarweb, Tesla, Twitter, vevo, 9GAG, Square, Yahoo, ..
8. About Airflow
¡ñ Open Sourced by Airbnb in June 2015
¡ñ Joined ASF¡¯s incubation program in March 2016
¡ñ 660+ contributors, 5.7k+ commits, 10k+ stars
¡ñ Used by 200+ companies :
¡ð Adobe, Airbnb, HBO, Intel, iFTTT, Lyft, PayPal, Pandora, Quora, Reddit,
similarweb, Tesla, Twitter, vevo, 9GAG, Square, Yahoo, ..
¡ð Wondershare since Dec 2018
¡ñ Currently we got working flow use it
¡ð Order, User, sales insight, ¡
¡ð Cluster maintenance
16. Scheduler
¡ñ start_date: 2019-01-01
¡ñ schedule_interval: 1 1 * * *
# real run_date execution_date
N/A 2019-01-01 01:01
1 2019-01-02 01:01 2019-01-01
2 2019-01-03 01:01 2019-01-02
execution_date = start_date + #N-th * interval
¡ñ Airflow using UTC ( ~1.10.x)
¡ð Patch source code
¡ö http://tinyurl.com/yatqngh8
¡ð Modify dag file
¡ö http://tinyurl.com/y8kmzz65
¡ñ interval also support @hourly, @daily
¡ð using crontab form to avoid stress peak(´í·å)
¡ñ set depends_on_past to true if needed
17. Jinja2 & macros
? Leverage macros & jinja2
? Let airflow control time
¨C Don¡¯t manipulate time in code,
use args instead
Variable Description
{{ ds }} the execution date as YYYY-MM-DD
{{ ds_nodash }} the execution date as YYYYMMDD
{{ [yesterday, tomorrow]_ds }} yesterday¡¯s date as YYYY-MM-DD
{{ [yesterday, tomorrow]_ds_nodash }} yesterday¡¯s date as YYYYMMDD
{{ ts }} execution_date.isoformat()
{{ dag }}, {{ task }} the DAG object, the Task object
{{ task_instance }}, {{ ti }} the task_instance object
{{ params }} user-defined params dictionary
18. Command
? Run in background
¨C webserver
¨C scheduler
¨C flower / worker
? Backfill
¨C New metric need historical data
¨C Re-run failed tasks
¨C Options
? --donot_pickle, --dry_run,
--rerun_failed_tasks,
--Ignore_dependencies
? Develop & Test
¨C list_tasks / list_dags
¨C server_logs
¨C run / test
¨C trigger_dag
19. Security
? Support
¨C Password
¨C LDAP
¨C Custom Auth
¨C Kerberos
¨C OAuth
? Github Enterprise Auth
? Google OAuth
? By default
¨C All access are open
? Secure access via SSL (https)
? REF
¨C http://tinyurl.com/y8nkzvsy
20. Reference
¡ñ Developing elegant workflows in Python code with Apache Airflow
(2017 PyCon@Euro)
¡ð https://www.youtube.com/watch?v=XJf-f56JbFM&t=1257
¡ñ Modern ETL-ing with Python and Airflow (and Spark) - (2017
PyCon@De)
¡ð https://www.youtube.com/watch?v=tcJhSaowzUI
¡ñ Apache Airflow in Production: A Fictional Example
¡ð https://www.youtube.com/watch?v=iTg-a4icf_I
¡ñ A Practical Introduction to Airflow (2016 PyData@SF)
¡ð https://www.youtube.com/watch?v=cHATHSB_450