Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 - - PowerPoint PPT Presentation

routing trillions of events per day twitter
SMART_READER_LITE
LIVE PREVIEW

Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 - - PowerPoint PPT Presentation

Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 Lohit VijayaRenu & Gary Steelman @lohitvijayarenu @efsie In this talk Event Logs at Twitter 1. Log Collection 2. Log Processing 3. Log Replication 4. The Future


slide-1
SLIDE 1

Routing Trillions of Events Per Day @Twitter

#ApacheBigData 2017

Lohit VijayaRenu & Gary Steelman @lohitvijayarenu @efsie

slide-2
SLIDE 2

1. Event Logs at Twitter 2. Log Collection 3. Log Processing 4. Log Replication 5. The Future 6. Questions

In this talk

slide-3
SLIDE 3

Overview

slide-4
SLIDE 4
  • Clients log events specifying a Category
  • name. Eg ads_view, login_event ...
  • Events are grouped together across all

clients into the Category

  • Events are stored on Hadoop Distributed

File System, bucketed every hour into separate directories ○ /logs/ads_view/2017/05/01/23 ○ /logs/login_event/2017/05/01/23

Life of an Event

Clients Aggregated by Category Storage HDFS Http Clients Clients

Client Daemon Client Daemon Client Daemon Http Endpoint

slide-5
SLIDE 5

~3PB <1500

Across millions of clients

>1T

Trillion Events a Day

  • f Data a Day

Nodes

Incoming uncompressed Collocated with HDFS datanodes

Event Log Stats

>600

Categories

Event groups by category

slide-6
SLIDE 6

Event Log Architecture

Clients Local log collection daemon Clients Aggregate log events grouped by Category Storage (HDFS) HTTP Remote Clients Log Processor Storage (HDFS) Storage (HDFS) Log Replicator Storage (HDFS) Inside DataCenter Storage (Streaming)

slide-7
SLIDE 7

Event Log Architecture

Clients Local log collection daemon Clients Aggregate log events grouped by Category Storage (HDFS) HTTP Remote Clients Log Processor Storage (HDFS) Storage (HDFS) Log Replicator Storage (HDFS) Inside DataCenter Storage (Streaming)

slide-8
SLIDE 8

Event Log Architecture

Clients Local log collection daemon Clients Aggregate log events grouped by Category Storage (HDFS) HTTP Remote Clients Log Processor Storage (HDFS) Storage (HDFS) Log Replicator Storage (HDFS) Inside DataCenter Storage (Streaming)

slide-9
SLIDE 9

Event Log Architecture

Clients Local log collection daemon Clients Aggregate log events grouped by Category Storage (HDFS) HTTP Remote Clients Log Processor Storage (HDFS) Storage (HDFS) Log Replicator Storage (HDFS) Inside DataCenter Storage (Streaming)

slide-10
SLIDE 10

Event Log Architecture

Clients Local log collection daemon Clients Aggregate log events grouped by Category Storage (HDFS) HTTP Remote Clients Log Processor Storage (HDFS) Storage (HDFS) Log Replicator Storage (HDFS) Inside DataCenter Storage (Streaming)

slide-11
SLIDE 11

Event Log Architecture

Events Events RT Storage (HDFS) Inside DC1 Events Events RT Storage (HDFS) Inside DC2 DW Storage (HDFS) Prod Storage (HDFS) DW Storage (HDFS) Cold Storage (HDFS) Prod Storage (HDFS)

slide-12
SLIDE 12

Event Log Architecture

Events Events RT Storage (HDFS) Inside DC1 Events Events RT Storage (HDFS) Inside DC2 DW Storage (HDFS) Prod Storage (HDFS) DW Storage (HDFS) Cold Storage (HDFS) Prod Storage (HDFS)

slide-13
SLIDE 13

Event Log Architecture

Events Events RT Storage (HDFS) Inside DC1 Events Events RT Storage (HDFS) Inside DC2 DW Storage (HDFS) Prod Storage (HDFS) DW Storage (HDFS) Cold Storage (HDFS) Prod Storage (HDFS)

slide-14
SLIDE 14

Collection

slide-15
SLIDE 15

Event Log Architecture

Clients Local log collection daemon Clients Aggregate log events grouped by Category Storage (HDFS) HTTP Remote Clients Log Processor Storage (HDFS) Storage (HDFS) Log Replicator Storage (HDFS) Inside DataCenter Storage (Streaming)

slide-16
SLIDE 16

Past

Event Collection Overview

Future

Scribe Client Daemon Scribe Aggregator Daemons Scribe Client Daemon Flume Aggregator Daemon Flume Aggregator Daemon Flume Client Daemon

Present

slide-17
SLIDE 17

Event Collection

Past

Challenges with Scribe

  • Too many open file handles to HDFS

○ 600 categories x 1500 aggregators x 6 per hour =~ 5.4M files per hour

  • High IO wait on DataNodes at scale
  • Max limit on throughput per aggregator
  • Difficult to track message drops
  • No longer active open source development
slide-18
SLIDE 18

Event Collection

Present

Apache Flume

  • Well defined interfaces
  • Open source
  • Concept of transactions
  • Existing implementations of

interfaces

Source Sink Channel Client HDFS Flume Agent

slide-19
SLIDE 19

Event Collection

Present

Category Group

  • Combine multiple related

categories into a category group

  • Provide different

properties per group

  • Contains multiple events

to generate fewer combined sequence files

Agent 1 Agent 2 Agent 3

Category 1 Category 3 Category 2 Category Group

slide-20
SLIDE 20

Group 1 Category Groups Aggregator Group 1 Aggregator Group 2

Event Collection

Present

Aggregator Group

  • A set of aggregators

hosting same set of category groups

  • Easy to manage

group of aggregators hosting subset of categories

Agent 1 Agent 2 Agent 3 Agent 8 Group 2

slide-21
SLIDE 21

Event Collection

Present

Flume features to support groups

  • Extend Interceptor to multiplex events into groups
  • Implement Memory Channel Group to have separate memory

channel per category group

  • ZooKeeper registration per category group for service discovery
  • Metrics for category groups
slide-22
SLIDE 22

Event Collection

Present

Flume performance improvements

  • HDFSEventSink batching increased (5x) throughput reducing

spikes on memory channel

  • Implement buffering in HDFSEventSink instead of using

SpillableMemoryChannel

  • Stream events close to network speed
slide-23
SLIDE 23

Processing

slide-24
SLIDE 24

Event Log Architecture

Clients Local log collection daemon Clients Aggregate log events grouped by Category Storage (HDFS) HTTP Remote Clients Log Processor Storage (HDFS) Storage (HDFS) Log Replicator Storage (HDFS) Inside DataCenter Storage (Streaming)

slide-25
SLIDE 25

>1PB 20-50%

To process one day of data

8

Wall Clock Hours Data per Day Disk Space

Output of cleaned, compressed, consolidated, and converted Saved by processing Flume sequence files

Log Processor Stats

Processing Trillion Events per Day

slide-26
SLIDE 26
  • Make processing log data easier for analytics teams
  • Disk space is at a premium on analytics clusters
  • Still too many files cause increased pressure on the NameNode
  • Log data is read many times and different teams all perform the same

pre-processing steps on the same data sets

Log Processor Needs

Processing Trillion Events per Day

slide-27
SLIDE 27

Datacenter 1

Log Processor Steps

ads_group/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh login_group/yyyy/mm/dd/hh Category Groups Categories Demux Jobs ads_group_demuxer login_group_demuxer

slide-28
SLIDE 28

Datacenter 1

Log Processor Steps

ads_group/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh login_group/yyyy/mm/dd/hh Category Groups Categories Demux Jobs ads_group_demuxer login_group_demuxer

slide-29
SLIDE 29

Datacenter 1

Log Processor Steps

ads_group/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh login_group/yyyy/mm/dd/hh Category Groups Categories Demux Jobs ads_group_demuxer login_group_demuxer

slide-30
SLIDE 30

Decode Base64 encoding from logged data

1

Demux Category groups into individual categories for easier consumption by analytics teams

2

Clean Corrupt, empty, or invalid records so data sets are more reliable

3

Compress Logged data to the highest level to save disk space. From LZO level 3 to LZO level 7

4

Consolidate Small files to reduce pressure on the NameNode

5

Convert Some categories into Parquet for fastest use in ad-hoc exploratory tools

6

Log Processor Steps

slide-31
SLIDE 31
  • Scribe’s contract amounts to sending a binary blob to a port
  • Scribe used new line characters to delimit records in a binary blob batch
  • f records
  • Valid records may include new line characters
  • Scribe base64 encoded received binary blobs to avoid confusion with

record delimiter

  • Base 64 encoding is no longer necessary because we have moved to one

serialized Thrift object per binary blob

Why Base64 Decoding?

Legacy Choices

slide-32
SLIDE 32

/logs/ads_click/yyyy/mm/dd/hh/1.lzo /logs/ads_view/yyyy/mm/dd/hh/1.lzo

Log Demux Visual

/raw/ads_group/yyyy/mm/dd/hh/ads_group_1.seq DEMUX /logs/ads_view/yyyy/mm/dd/hh/1.lzo

slide-33
SLIDE 33

/logs/ads_click/yyyy/mm/dd/hh/1.lzo /logs/ads_view/yyyy/mm/dd/hh/1.lzo

Log Demux Visual

/raw/ads_group/yyyy/mm/dd/hh/ads_group_1.seq DEMUX /logs/ads_view/yyyy/mm/dd/hh/1.lzo

slide-34
SLIDE 34

/logs/ads_click/yyyy/mm/dd/hh/1.lzo /logs/ads_view/yyyy/mm/dd/hh/1.lzo

Log Demux Visual

/raw/ads_group/yyyy/mm/dd/hh/ads_group_1.seq DEMUX /logs/ads_view/yyyy/mm/dd/hh/1.lzo

slide-35
SLIDE 35
  • One log processor daemon per RT Hadoop cluster, where Flume

aggregates logs

  • Primarily responsible for demuxing category groups out of the Flume

sequence files

  • The daemon schedules Tez jobs every hour for every category group in a

thread pool

  • Daemon atomically presents processed category instances so partial data

can’t be read

  • Processing proceeds according to criticality of data or “tiers”

Log Processor Daemon

slide-36
SLIDE 36
  • Some categories are significantly larger than other categories (KBs v TBs)
  • MapReduce demux? Each reducer handles a single category
  • Streaming demux? Each spout or channel handles a single category
  • Massive skew in partitioning by category causes long running tasks which

slows down job completion time

  • Relatively well understood fault tolerance semantics similar to

MapReduce, Spark, etc

Why Tez?

slide-37
SLIDE 37
  • Tez’s dynamic hash partitioner adjusts partitions at runtime if necessary,

allowing large partitions to be further partitioned so multiple tasks process events for a single category one task ○ More info at TEZ-3209. ○ Thanks to team member Ming Ma for the contribution!

  • Easier horizontal scaling simultaneously providing more predictable

processing times

Dynamic Partitioning

slide-38
SLIDE 38

Typical Partitioning Visual

Task 3 Task 2 Input File 1 Task 1

slide-39
SLIDE 39

Hash Partitioning Visual

Task 3 Task 2 Input File 1 Task 1 Task 4 Task 5

slide-40
SLIDE 40

Replication

slide-41
SLIDE 41

Event Log Architecture

Clients Local log collection daemon Clients Aggregate log events grouped by Category Storage (HDFS) HTTP Remote Clients Log Processor Storage (HDFS) Storage (HDFS) Log Replicator Storage (HDFS) Inside DataCenter Storage (Streaming)

slide-42
SLIDE 42

>1PB ~10

Across all analytics clusters

>24k

Copy Jobs per Day PB of Data per Day Analytics Clusters

Replicated to analytics clusters

Log Replication Stats

Replicating Trillion Events per Day

slide-43
SLIDE 43
  • Collocate logs with compute and disk capacity for analytics teams

○ Cross-data center reads are incredibly expensive ○ Cross-rack reads within data center are still expensive

  • Critical data set copies and backups in case of data center failures

Log Replication Needs

Processing Trillion Events per Day

slide-44
SLIDE 44

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-45
SLIDE 45

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-46
SLIDE 46

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-47
SLIDE 47

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-48
SLIDE 48

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-49
SLIDE 49

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-50
SLIDE 50

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-51
SLIDE 51

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-52
SLIDE 52

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-53
SLIDE 53

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-54
SLIDE 54

Datacenter N Datacenter 1

Log Replication Visual

ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh ads_view/yyyy/mm/dd/hh login_event/yyyy/mm/dd/hh ads_click/yyyy/mm/dd/hh Replication Jobs ads_click_repl ads_view_repl ads_click_repl login_event_repl

slide-55
SLIDE 55

Copy Logged data from all processing clusters to the target cluster. 1 Merge Copied data into one directory. 2 Present Data atomically by renaming it to an accessible location. 3 Publish Metadata to the Data Abstraction Layer to notify analytics teams data is ready for consumption. 4

Log Replication Steps

Distributing Trillion Events per Day

slide-56
SLIDE 56
  • One log replicator daemon per DW, PROD, or COLD Hadoop cluster,

where analytics users run queries and jobs

  • Primarily responsible for copying category partitions out of the RT Hadoop

clusters

  • The daemons schedule Hadoop DistCp jobs every hour for every category
  • Daemon atomically presents processed category instances so partial data

can’t be read

  • Replication proceeds according to criticality of data or “tiers”

Log Replicator Daemon

slide-57
SLIDE 57

The Future

slide-58
SLIDE 58

Future of Log Management

  • Flume client improvement for tracing, SLA and throttling
  • Flume client to support for message validation before logging
  • Centralized configuration management
  • Processing and replication every 10 minutes instead of every hour
slide-59
SLIDE 59

Thank You

Questions?