Building a Big Data DWH Data Warehousing on Hadoop Friso van - - PowerPoint PPT Presentation

building a big data dwh
SMART_READER_LITE
LIVE PREVIEW

Building a Big Data DWH Data Warehousing on Hadoop Friso van - - PowerPoint PPT Presentation

Building a Big Data DWH Data Warehousing on Hadoop Friso van Vollenhoven @fzk CTO frisovanvollenhoven@godatadriven.com Go DataDriven PROUDLY PART OF THE XEBIA GROUP In computing, a data warehouse or enterprise data warehouse (DW, DWH, or


slide-1
SLIDE 1

GoDataDriven

PROUDLY PART OF THE XEBIA GROUP

@fzk frisovanvollenhoven@godatadriven.com

Building a Big Data DWH

Friso van Vollenhoven CTO

Data Warehousing on Hadoop

slide-2
SLIDE 2
  • - Wikipedia

“In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis.”

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

ETL

slide-7
SLIDE 7
slide-8
SLIDE 8

How to:

  • Add a column to the facts table?
  • Change the granularity of dates from day

to hour?

  • Add a dimension based on some

aggregation of facts?

slide-9
SLIDE 9

Schema’s are designed with questions in mind. Changing it requires to redo the ETL.

slide-10
SLIDE 10

Schema’s are designed with questions in mind. Changing it requires to redo the ETL. Push things to the facts level. Keep all source data available all times.

slide-11
SLIDE 11
slide-12
SLIDE 12

And now?

  • MPP databases?
  • Faster / better / more SAN?
  • (RAC?)
slide-13
SLIDE 13
slide-14
SLIDE 14

distributed storage distributed processing metadata + query engine

slide-15
SLIDE 15
slide-16
SLIDE 16

EXTRACT TRANSFORM LOAD

slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
  • No JVM startup overhead for Hadoop API usage
  • Relatively concise syntax (Python)
  • Mix Python standard library with any Java libs
slide-23
SLIDE 23
slide-24
SLIDE 24
  • Flexible scheduling with dependencies
  • Saves output
  • E-mails on errors
  • Scales to multiple nodes
  • REST API
  • Status monitor
  • Integrates with version control
slide-25
SLIDE 25
slide-26
SLIDE 26

Deployment

git push jenkins master

slide-27
SLIDE 27
  • Scheduling
  • Simple deployment of ETL code
  • Scalable
  • Developer friendly
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33

'februari-22 2013'

slide-34
SLIDE 34
slide-35
SLIDE 35

A: Yes, sometimes as

  • ften as 1 in every 10K
  • calls. Or about once a

week at 3K files / day.

slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38

þ

slide-39
SLIDE 39

þ

slide-40
SLIDE 40

TSV == thorn separated values?

slide-41
SLIDE 41

þ == 0xFE

slide-42
SLIDE 42
  • r -2, in Hive

CREATE TABLE browsers ( browser_id STRING, browser STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';

slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49
  • The format will change
  • Faulty deliveries will occur
  • Your parser will break
  • Records will be mistakingly produced (over-logging)
  • Other people test in production too (and you get the

data from it)

  • Etc., etc.
slide-50
SLIDE 50
  • Simple deployment of ETL code
  • Scheduling
  • Scalable
  • Independent jobs
  • Fixable data store
  • Incremental where possible
  • Metrics
slide-51
SLIDE 51

Independent jobs

source (external) staging (HDFS) hive-staging (HDFS) Hive

HDFS upload + move in place MapReduce + HDFS move Hive map external table + SELECT INTO

slide-52
SLIDE 52

Out of order jobs

  • At any point, you don’t really know what ‘made it’

to Hive

  • Will happen anyway, because some days the data

delivery is going to be three hours late

  • Or you get half in the morning and the other half

later in the day

  • It really depends on what you do with the data
  • This is where metrics + fixable data store help...
slide-53
SLIDE 53

Fixable data store

  • Using Hive partitions
  • Jobs that move data from staging create partitions
  • When new data / insight about the data arrives,

drop the partition and re-insert

  • Be careful to reset any metrics in this case
  • Basically: instead of trying to make everything

transactional, repair afterwards

  • Use metrics to determine whether data is fit for

purpose

slide-54
SLIDE 54

Metrics

slide-55
SLIDE 55

Metrics service

  • Job ran, so may units processed, took so much

time

  • e.g. 10GB imported, took 1 hr
  • e.g. 60M records transformed, took 10 minutes
  • Dropped partition
  • Inserted X records into partition
slide-56
SLIDE 56

GoDataDriven

We’re hiring / Questions? / Thank you!

@fzk frisovanvollenhoven@godatadriven.com Friso van Vollenhoven CTO