Building Reusable and Trustworthy pipelines 1 Airflow Summit - - PowerPoint PPT Presentation

building reusable and trustworthy pipelines
SMART_READER_LITE
LIVE PREVIEW

Building Reusable and Trustworthy pipelines 1 Airflow Summit - - PowerPoint PPT Presentation

Building Reusable and Trustworthy pipelines 1 Airflow Summit 2020, @nehiljain Outline 1. Context 2. Design Requirements 3. Proposed Solution 4. Example Code 2 Airflow Summit 2020, @nehiljain Context 3 Airflow Summit 2020,


slide-1
SLIDE 1

Building Reusable and Trustworthy pipelines

1 — Airflow Summit 2020, @nehiljain

slide-2
SLIDE 2

Outline

  • 1. Context
  • 2. Design Requirements
  • 3. Proposed Solution
  • 4. Example Code

2 — Airflow Summit 2020, @nehiljain

slide-3
SLIDE 3

Context

3 — Airflow Summit 2020, @nehiljain

slide-4
SLIDE 4

Hello !

▸ Data engineer @ SnapTravel ▸ SnapTravel ▸ M-commerce startup ▸ Data team: 8, Data Sources: 86 ▸ Data infrastructure, Data engineering, Analytics engineering ▸ + + + stack

4 — Airflow Summit 2020, @nehiljain
slide-5
SLIDE 5

Purpose

Share ! BI pipelines " Community with lessons learnt # feedback

5 — Airflow Summit 2020, @nehiljain

slide-6
SLIDE 6

How are my company ?

gross_revenue

contribution_margin

number_of_active_users

retention_rate

conversion_rate

6 — Airflow Summit 2020, @nehiljain

slide-7
SLIDE 7

Hows my airflow repo ?

number_prs_merged

number_prs_closed_without_merge

number_prs_opened

number_of_commits

7 — Airflow Summit 2020, @nehiljain

slide-8
SLIDE 8

8 — Airflow Summit 2020, @nehiljain

slide-9
SLIDE 9

Let us consider

▸ The pipeline failed in production ▸ Shift focus on to issues, comments ▸ Gitlab released a new version of API ▸ I want to analyze other apache projects too ▸ Github produced similar insights and their numbers didn't match mine

9 — Airflow Summit 2020, @nehiljain

slide-10
SLIDE 10

! Been there done that?

10 — Airflow Summit 2020, @nehiljain

slide-11
SLIDE 11

Classify the problems

▸ Toil ▸ Cannot scale Data Analytics ▸ Data Discovery ▸ Data Trust ▸ Throw over the boundary ▸ Ambiguous ownership

11 — Airflow Summit 2020, @nehiljain

slide-12
SLIDE 12

What can we do to solve this?

12 — Airflow Summit 2020, @nehiljain

slide-13
SLIDE 13

..build tools, infrastructure, frameworks and services — Maxime Beauchemin

13 — Airflow Summit 2020, @nehiljain

slide-14
SLIDE 14

Design Requirements

14 — Airflow Summit 2020, @nehiljain

slide-15
SLIDE 15

15 — Airflow Summit 2020, @nehiljain

slide-16
SLIDE 16

Single Source of Truth

▸ Standardization ▸ Data Lineage ▸ Empower non-technical folks

16 — Airflow Summit 2020, @nehiljain

slide-17
SLIDE 17

Easy to consume

▸ Airflow + Other OSS ▸ Ideally pip install awesome-elt-tool ▸ Low barrier to entry for data analytics ▸ Operational creep

17 — Airflow Summit 2020, @nehiljain

slide-18
SLIDE 18

Promote data integrity

▸ Test the raw data supply ▸ Automated analytics testing

18 — Airflow Summit 2020, @nehiljain

slide-19
SLIDE 19

Meta Data Engineering

19 — Airflow Summit 2020, @nehiljain

slide-20
SLIDE 20

20 — Airflow Summit 2020, @nehiljain

slide-21
SLIDE 21

Proposed Solution

21 — Airflow Summit 2020, @nehiljain

slide-22
SLIDE 22

Conceptually

22 — Airflow Summit 2020, @nehiljain

slide-23
SLIDE 23

ETL vs ELT

▸ Load once and transform ▸ Reduced complexity ▸ Reduce cost ▸ Speed of delivery

23 — Airflow Summit 2020, @nehiljain

slide-24
SLIDE 24

Validate your source data

24 — Airflow Summit 2020, @nehiljain

slide-25
SLIDE 25

expect_column_to_exist

expect_table_row_count_to_be_between

expect_table_row_count_to_equal

expect_multicolumn_values_to_be_unique

expect_column_values_to_not_be_null

expect_column_values_to_be_null

expect_column_fancy_statistic_to_be

25 — Airflow Summit 2020, @nehiljain
slide-26
SLIDE 26

Why?

▸ Profiling ▸ Data Docs <-> Tests ▸ Send notifications automatically

26 — Airflow Summit 2020, @nehiljain

slide-27
SLIDE 27

Extract - Load

27 — Airflow Summit 2020, @nehiljain

slide-28
SLIDE 28

Singer - What?

28 — Airflow Summit 2020, @nehiljain

slide-29
SLIDE 29

tap-github --config tap_config.json | target-postgres --config target_config.json >> state.json

29 — Airflow Summit 2020, @nehiljain

slide-30
SLIDE 30

Singer - Why?

▸ Standardized communication ▸ Incremental out of the box ▸ Documentation ▸ See your data in under 10 mins

30 — Airflow Summit 2020, @nehiljain

slide-31
SLIDE 31

31 — Airflow Summit 2020, @nehiljain

slide-32
SLIDE 32

It's a long list

32 — Airflow Summit 2020, @nehiljain

slide-33
SLIDE 33

Transform

33 — Airflow Summit 2020, @nehiljain

slide-34
SLIDE 34

DBT - What?

34 — Airflow Summit 2020, @nehiljain

slide-35
SLIDE 35

35 — Airflow Summit 2020, @nehiljain

slide-36
SLIDE 36

36 — Airflow Summit 2020, @nehiljain

slide-37
SLIDE 37

DBT - Why?

▸ Modular code

37 — Airflow Summit 2020, @nehiljain

slide-38
SLIDE 38

DBT - Why?

▸ Modular code ▸ Testing is 1st Class

38 — Airflow Summit 2020, @nehiljain

slide-39
SLIDE 39

DBT - Why?

▸ Modular code ▸ Testing is 1st Class ▸ Data documentation is 1st Class

39 — Airflow Summit 2020, @nehiljain

slide-40
SLIDE 40

Great adoption

40 — Airflow Summit 2020, @nehiljain

slide-41
SLIDE 41

All together

41 — Airflow Summit 2020, @nehiljain

slide-42
SLIDE 42

Meltano

▸ Open Source, GitLab ▸ Self Hosted

pip3 install meltano meltano init airflow-analytics-project meltano add extractor tap-github meltano add loader target-postgres meltano add transformer dbt meltano add transform tap-github # add env variables meltano elt tap-gitlab target-postgres --transform=run --job_id=gitlab-to-postgres meltano add orchestrator airflow

42 — Airflow Summit 2020, @nehiljain

slide-43
SLIDE 43

Let's look at the code

43 — Airflow Summit 2020, @nehiljain

slide-44
SLIDE 44

44 — Airflow Summit 2020, @nehiljain

slide-45
SLIDE 45

A templated approach

45 — Airflow Summit 2020, @nehiljain

slide-46
SLIDE 46

46 — Airflow Summit 2020, @nehiljain

slide-47
SLIDE 47

47 — Airflow Summit 2020, @nehiljain

slide-48
SLIDE 48

48 — Airflow Summit 2020, @nehiljain

slide-49
SLIDE 49

49 — Airflow Summit 2020, @nehiljain

slide-50
SLIDE 50

Sit back & Relax

50 — Airflow Summit 2020, @nehiljain

slide-51
SLIDE 51

Some challenges out there

▸ Visualisation/BI layer ▸ Analytics code coverage ▸ Singer community

51 — Airflow Summit 2020, @nehiljain

slide-52
SLIDE 52

Key Takeaways

▸ Standardized tooling ▸ ELT >> ETL ▸ GE + Singer + DBT orchestrated by Airflow

52 — Airflow Summit 2020, @nehiljain

slide-53
SLIDE 53

Thanks

53 — Airflow Summit 2020, @nehiljain

slide-54
SLIDE 54

Q & A

54 — Airflow Summit 2020, @nehiljain

slide-55
SLIDE 55

Resources

▸ Meltano Project ▸ Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin ▸ The Rise of the Data Engineer ▸ The Future of Data Engineering ▸ Downfall of the data engineer

55 — Airflow Summit 2020, @nehiljain

slide-56
SLIDE 56

Resources

▸ Singer | Open Source ETL ▸ Why we are building an open-source platform for ELT pipelines - Meltano ▸ Dbt Docs

56 — Airflow Summit 2020, @nehiljain