Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu - - PowerPoint PPT Presentation

taskerman
SMART_READER_LITE
LIVE PREVIEW

Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu - - PowerPoint PPT Presentation

Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer Distributed Systems Yelps Mission Connecting people with great local businesses. Datastore Ecosystem @ Cassandra Elasticsearch Zookeeper


slide-1
SLIDE 1

Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer Distributed Systems

Taskerman

A Distributed Cluster Task Manager

slide-2
SLIDE 2

Yelp’s Mission

Connecting people with great local businesses.

slide-3
SLIDE 3

Datastore Ecosystem @

slide-4
SLIDE 4

Cassandra Elasticsearch Zookeeper PostgreSQL

slide-5
SLIDE 5

5

….

  • Memcached
  • Redis
  • Spark
  • Redshift
  • DynamoDB
  • PaaStorm
  • S3

Any many more..

slide-6
SLIDE 6
  • Several TB in Cassandra clusters with tens of nodes each
  • Close to a million messages/second in streaming pipeline
  • Several TB in Elasticsearch with several hundred nodes in

each

  • Many PB archived to S3 every month
  • Multi-AZ Multi-Region
  • And growing…

Distributed Systems

slide-7
SLIDE 7
slide-8
SLIDE 8

“Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra cluster without spiking read latency” “Reboot 1000 instances without taking a millennia but not bringing down site either” “Upgrade an Elasticsearch cluster from m3.medium to m3.xlarge safely without downtime”

slide-9
SLIDE 9

Pet vs Cattle

slide-10
SLIDE 10

Maintenance Cost Engineering Efficiency Scalability

slide-11
SLIDE 11

Taskerman

slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
  • Safe
  • Security
  • Generic and Extensible
  • Distributed
  • Loosely coupled
  • Cluster awareness

Requirements

slide-16
SLIDE 16
  • Schedulable
  • Reusable
  • Auditability

○ Not Ad-hoc ○ More Declarative, Less Imperative ○ Config Management

  • Maintainability
  • Observability
  • Resilience

Desirable

slide-17
SLIDE 17
  • Paramount*
  • Serialized execution

○ ‘m’ out of ‘n’ ○ Disjoint jobs.

  • Avoid cascade
  • Privilege escalation
  • Pull-based

* Unless oncall is automated too.

Safety

slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
  • Network is reliable
  • Latency is zero
  • Bandwidth is infinite
  • Network is secure
  • One administrator
  • Transport cost is zero
  • Network is homogenous
  • Topology doesn't change

Fallacies of Distributed System

slide-21
SLIDE 21

Quotes

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. @secretGeek There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order

  • f messages 2. Exactly-once delivery @mathiasverraes
slide-22
SLIDE 22
  • Scheduler
  • Router
  • Co-ordinator
  • Transport
  • Executor
  • Error handler
  • Configuration
  • Monitoring
  • Tooling

Building Blocks

slide-23
SLIDE 23

Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API

slide-24
SLIDE 24

#Anatomy of a Taskerman Task # Restart action for 2 nodes of geo_counter # cassandra cluster owned by gsi { ‘action’: ‘cassandra_task:restart’, ‘version’: 1.2, ‘limit’: 2, ‘cluster_name’: ‘cassandra:geo_counter’, ‘discovery’ : ‘aws_tags’, ‘owner’: ‘gsi’, ‘task_id’: ‘abcd-ef123’,

slide-25
SLIDE 25

#Anatomy of a Taskerman Task ‘taskerman_params’: { ‘action_args’: {‘force’: true}, ‘workqueue_args’: {‘retry_count’:3}, }, ‘nodes’: [], ‘destnode’: ‘’, } # force=true for restart, retry_count for queue # [a,b,c,d] to skip discovery

slide-26
SLIDE 26
slide-27
SLIDE 27

Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API

slide-28
SLIDE 28
  • Runs on Chronos
  • Emits a task
  • Enqueues into global queue
  • Ad-hoc invocation
  • Deployment granularities
  • Task tracking
  • Yelpsoa-configs

Task Scheduler

PaaSTA

slide-29
SLIDE 29

Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API

slide-30
SLIDE 30
  • AWS SQS
  • Best-effort FIFO
  • Reliable and cheap
  • Low latency
  • Properties

○ Read without delete ○ Visibility timeout ○ Retry ○ Dead Letter Queue

WorkQueue

AWS SQS

slide-31
SLIDE 31

Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API

slide-32
SLIDE 32
  • Stateless Marathon worker
  • Routes tasks to clusters
  • Custom routing logic
  • At-least once delivery
  • ‘DNS’ of Taskerman
  • Pluggable discovery

○ AWS ○ Smartstack

Task Router

PaaSTA

slide-33
SLIDE 33

Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API

slide-34
SLIDE 34
  • The executor of Taskerman
  • Dequeue task and executes

○ Pre-defined reviewed code.

  • Cron-ed on node
  • Zookeeper for coordination
  • Task deleted upon success
  • Dead letter queue upon failed

retries

TaskRunner

slide-35
SLIDE 35

class TestTaskRunner(TaskRunner): def __init__(self, task,..): # State mgmt and datastore specific def pre_check(self): # Is the task safe to execute on this cluster def execute_action(self): # Actual execution of task:action def post_check(self): # cluster good after execution or is it on fire

slide-36
SLIDE 36

Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries EC2 API Zookeeper

slide-37
SLIDE 37
slide-38
SLIDE 38
  • Distributed Coordinator
  • Non Blocking Lease

○ Time-based lease ○ Global lease

  • Ephemeral locks
  • Atomic Counters

○ Statistics ○ Circuit breaker

Zookeeper

slide-39
SLIDE 39
  • Staleness

○ Nodes can go down

  • Garbage collection

○ Cleanup of ZK data structures

  • Composition
  • Starvation
  • Uptime

Zookeeper: Challenges

slide-40
SLIDE 40
slide-41
SLIDE 41
  • Puppet
  • Terraform
  • Yelpsoa-configs
  • PaaSTA
  • Jenkins
  • AWS Lambda

Deployment

PaaSTA

slide-42
SLIDE 42
slide-43
SLIDE 43
  • Multiple vectors of failure
  • Idempotency
  • Pessimistic approach

○ Job retry

  • Separation of state
  • Mutability
  • Highly available components
  • Circuit breakers

Failure handling

slide-44
SLIDE 44

Debugging

slide-45
SLIDE 45
  • Heartbeat ping

○ End-to-end monitoring

  • Dead Letter Queue

○ Recycle bin of failed tasks. ○ Hooks into human side of monitoring

  • Status check

Failure detection

slide-46
SLIDE 46
slide-47
SLIDE 47
  • End-to-end logging

○ Un/structured

  • Metrics

○ Counters ○ Queue lengths

  • Aggregation and dashboards
  • Staleness checks
  • Dead Letter Queue
  • Multi-modal Alerting

Monitoring

slide-48
SLIDE 48
  • Restarts
  • Reboots
  • Instance Replacement
  • Integration tests
  • Kafka config reload
  • Failure injection
  • Backup and restore
  • Search indexing
  • .. and many more.

Use cases

slide-49
SLIDE 49
  • Safety
  • Cassandra
  • Elasticsearch
  • Common issues
  • Constraints

○ Limit ○ Healthcheck ○ Mutual exclusion

Scheduled Backups

slide-50
SLIDE 50

Secure Infrastructure

$ uptime 06:52:54 up 99 days, 19:20, 1 user, load average: 0.02, 0.03, 0.07 ps -eo pid,cmd,lstart | grep .. 10058 zookeeper Tue Dec 5 05:23:43 2017

slide-51
SLIDE 51

www.yelp.com/careers/

We're Hiring!

slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55

@YelpEngineering fb.com/YelpEngineers engineeringblog.yelp.com github.com/yelp

slide-56
SLIDE 56

Q & A

  • Slides will also be uploaded to

slideshare.net/slidunder.

slide-57
SLIDE 57

Q & A

❖ Q: What challenges remain with Taskerman. ➢ A: ❖ Q: … ➢ A: …

slide-58
SLIDE 58
slide-59
SLIDE 59
  • https://www.elastic.co/products/elasticsearch
  • https://zookeeper.apache.org/
  • https://kafka.apache.org/
  • https://www.flickr.com/photos/dapuglet/6291424431
  • http://www.alamy.com/stock-photo/cattle-penning.html
  • http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg
  • https://sensuapp.org/img/logo-flat-white.png
  • https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif
  • https://www.percona.com/sites/default/files/dashboard.png
  • https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d
  • http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve

n-know-existed-can-leslie-lamport-346227.jpg

  • https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg
  • https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg
  • https://github.com/mesos/chronos
  • https://github.com/mesosphere

Image Credits

slide-60
SLIDE 60
  • http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png
  • http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png
  • https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png
  • http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor

e-if-you-write-brian-kernighan-66-91-06.jpg

  • https://thenounproject.com/
  • https://aws.amazon.com/
  • https://www.splunk.com/
  • https://www.terraform.io/
  • http://yelp.com
  • http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/

Image Credits

slide-61
SLIDE 61
  • https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html
  • http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
  • https://martinfowler.com/bliki/TwoHardThings.html
  • https://zookeeper.apache.org/
  • https://www.terraform.io/
  • https://github.com/Yelp/service-principles
  • https://en.wikipedia.org/wiki/Law_of_Demeter

Further Reading