THE DISTRIBUTED PIT OF SUCCESS Greg Beech ABOUT ME Lead Engineer - - PowerPoint PPT Presentation

the distributed pit of success
SMART_READER_LITE
LIVE PREVIEW

THE DISTRIBUTED PIT OF SUCCESS Greg Beech ABOUT ME Lead Engineer - - PowerPoint PPT Presentation

THE DISTRIBUTED PIT OF SUCCESS Greg Beech ABOUT ME Lead Engineer @ Deliveroo joined March 2015 Tech lead for international expansion Set up Deliveroo for Business Currently rebuilding our Live Operations tooling PAST


slide-1
SLIDE 1
slide-2
SLIDE 2

THE DISTRIBUTED PIT OF SUCCESS

Greg Beech

slide-3
SLIDE 3

ABOUT ME

Lead Engineer @ Deliveroo — joined March 2015

  • Tech lead for international expansion
  • Set up Deliveroo for Business
  • Currently rebuilding our Live Operations tooling
PAST
  • Head of Platform Development @ blinkbox Books
  • Principal Engineer @ blinkbox Movies
  • Test Engineer @ Microsoft
slide-4
SLIDE 4

ABOUT DELIVEROO

slide-5
SLIDE 5 Body Body

2013

RESTAURANT-QUALITY FOOD TO YOUR HOME FOUNDED

THEN NOW

slide-6
SLIDE 6

FUNDING RAISED

$475M

slide-7
SLIDE 7

2015 2017 2013

DAILY ORDERS

slide-8
SLIDE 8

12 COUNTRIES 150 CITIES

slide-9
SLIDE 9

2015 2017 2016 2014 2013

ENGINEERS

600,000 SLOC 38,000 COMMITS 6,900 PULL REQUESTS 3,200 DEPLOYS

slide-10
SLIDE 10

TECHNOLOGY CHALLENGES

slide-11
SLIDE 11

ARCHITECTURE

slide-12
SLIDE 12

2015 2017 2016

HEROKU APP LIMIT
  • BIGGEST. HEROKU. APP. EVER.
LITERALLY ONE SERVER

APP SERVERS

slide-13
SLIDE 13
  • General purpose models sub-optimal in most cases
  • Caching difficult due to geo, availability, timings, etc.
  • Long dyno boot time makes auto-scaling slow
  • Constrained to Ruby on Rails

DEGRADING PERFORMANCE

slide-14
SLIDE 14 Q4 '15 Q1 '16 Q2 '16 Q3 '16 Q4 '16 Q1 '17

DEGRADING PERFORMANCE

Restaurant List TTFB

slide-15
SLIDE 15

2017 2h15 2016 25 min 2015 7 min 2014 4 min 2013 2 min

BUILD TIMES

https://xkcd.com/303/
slide-16
SLIDE 16

DEVELOPMENT PROCESS

master ticket staging QA
slide-17
SLIDE 17
  • CI becomes part of dev workflow
  • 70+ developers causes merge conflicts
  • “God objects” are hard to understand
  • “House of Cards” development in certain areas

REDUCING DEVELOPMENT VELOCITY

slide-18
SLIDE 18 Q4 '15 Q1 '16 Q2 '16 Q3 '16 Q4 '16 Q1 '17

Pull Requests per Engineer

REDUCING DEVELOPMENT VELOCITY

slide-19
SLIDE 19
  • Single problem can bring everything down
  • Placing orders
  • Customer service
  • Rider dispatch
  • Rollbacks increasingly difficult with commit frequency
  • PG replication, analyse and vacuum settings critical

DECREASING RELIABILITY

slide-20
SLIDE 20 Q4 '15 Q1 '16 Q2 '16 Q3 '16 Q4 '16 Q1 '17

Uptime # Outages (Unscheduled)

DECREASING RELIABILITY

slide-21
SLIDE 21

2017 2019 2015

IT’S NOT GOING TO GET EASIER

slide-22
SLIDE 22

HOW DO WE FIX THIS?

slide-23
SLIDE 23

MONITORING MONITORING CLIENT APPS CLIENT APPS EVENT BUS

LARGE SCALE ARCHITECTURE

DOMAIN SERVICES EDGE SERVICES

slide-24
SLIDE 24

DOMAIN SERVICES

  • Owns part of the domain
  • Granular, purely RESTful APIs
  • Send & receive from the bus
  • Use other domain service APIs
MONITORING EVENT BUS DOMAIN SERVICES EDGE SERVICES CLIENT APPS
slide-25
SLIDE 25

EDGE SERVICES

  • Does not own any of the domain
  • Presents more aggregated API, implement search, etc.
  • Receive-only from the bus
  • Use domain or edge service APIs
MONITORING EVENT BUS DOMAIN SERVICES EDGE SERVICES CLIENT APPS
slide-26
SLIDE 26

1-4 SERVICES/APPS PER TEAM

slide-27
SLIDE 27

12 FACTOR APPS

  • One codebase, many deploys
  • Scale out as stateless processes
  • Dev/prod parity (time, personnel, tools)
  • Find the rest at https://12factor.net/
slide-28
SLIDE 28

DATA SHARING RULES

  • No shared data stores — no exceptions
  • All internal data exposed as REST APIs — no RPC
  • Publish events when data changes — no payloads
slide-29
SLIDE 29

ACTUAL REST WITH HYPERMEDIA

{ "_links": { "self": { "href": "https://api.deliveroo.com/orders/2457" }, "restaurant": { "href": "https://api.deliveroo.com/restaurants/203" }, "user": { "href": "https://api.deliveroo.com/users/814" } }, "id": 2457, "status": "placed", "deliver_from": "2016-09-21T09:25:00Z", "deliver_by": "2016-09-21T09:35:00Z", "scheduled": false }

LINKS!!

slide-30
SLIDE 30

THE N+1 SELECTS PROBLEM

DOMAIN GET /restaurants?geohash=gcpvhepze4b8 GET /restaurants/812 GET /restaurants/1873 GET /restaurants/203 GET /restaurants/1074 GET /restaurants/1309 GET /restaurants/2132 GET /restaurants/2493 GET /restaurants/2873 EDGE
slide-31
SLIDE 31

THE N+1 SELECTS SOLUTION?

GET /restaurants?geohash=gcpvhepze4b8 GET /restaurants/812 GET /restaurants/1873 GET /restaurants/203 GET /restaurants/1074 GET /restaurants/1309 GET /restaurants/2132 GET /restaurants/2493 GET /restaurants/2873 LOCAL CACHE EDGE DOMAIN
slide-32
SLIDE 32

CACHE CORRECTNESS

GET /restaurants/203 EDGE DOMAIN GET /restaurants/203 LOCAL CACHE If-None-Match: "xxx"

ETag: High consistency but high latency Cache-Control: Low latency but low consistency

GET /restaurants/203 EDGE DOMAIN GET /restaurants/203 LOCAL CACHE (only if expired)
slide-33
SLIDE 33

REPRESENTATIONAL STATE NOTIFICATION (RESN)

  • Send CREATE/UPDATE/DELETE events for entities
{ "topic": "restaurants", "type": "update", "href": "https://api.deliveroo.com/restaurants/203" }
  • Low latency and high consistency
GET /restaurants/203 UPDATE /restaurants/203 GET /restaurants/203 LOCAL CACHE DOMAIN GET /restaurants/203 EDGE
slide-34
SLIDE 34

WHY NO PAYLOADS?

  • Transfer of authority from service to bus
  • Encourages incomplete domain modelling
  • Bus becomes a critical source for data loss
  • Need to handle multiple representations in consumers
slide-35
SLIDE 35

WHAT ABOUT STREAMS?

  • Tens of millions of location/availability pings per day
  • Nonsensical to model as entities with identity
  • Non-critical immutable value objects may be sent in payloads
{ "_links": { "rider": { "href": "https://api.deliveroo.com/riders/872" } }, "created_at": "2016-09-21T09:25:00Z", "latitude": 51.52168804, "longditude": -0.14303600, }
slide-36
SLIDE 36

LANGUAGES AND TOOLS

slide-37
SLIDE 37

WHICH LANGUAGE SHOULD WE USE?

slide-38
SLIDE 38

WE LOVE RUBY

Easier migration path from existing codebase Well known within the company Performant enough for most applications Quick to write, test and iterate No need to argue over approaches & standards

slide-39
SLIDE 39

ROO ON RAILS

$ rails new my_service --database=postgresql $ cd my_service $ echo "gem 'roo_on_rails'" >> Gemfile $ bundle && bundle exec roo_on_rails
slide-40
SLIDE 40

CASE STUDY: LIVE OPERATIONS

slide-41
SLIDE 41

WHAT IS LIVE OPS?

  • Manual intervention to get orders delivered
  • Finding and resolving issues with orders, riders, etc.
  • Like software, easier to fix things the earlier you find them
slide-42
SLIDE 42

LIVE OPERATIONS HISTORICALLY

  • rder id time restaurant address zone status

REPEATED SCANNING

slide-43
SLIDE 43

WHAT ARE OUR GOALS?

  • Reduce investigations per order by 93%
  • Reduce or hold “unacceptable” orders
  • Give visibility into live issues
slide-44
SLIDE 44

LIVE OPERATIONS NOW: BACKEND

  • rders
riders pickups deliveries live issues RULES
 ENGINE NOTIFY
 HANDLERS EVENT HANDLERS
slide-45
SLIDE 45

LIVE OPERATIONS NOW: DASHBOARD

slide-46
SLIDE 46

ARCHITECTURE: NOW

LIVE OPS API DASHBOARD EVENT BUS
  • rders, riders, etc.
slide-47
SLIDE 47

ARCHITECTURE: NEXT

DASHBOARD EVENT BUS live issues, etc. LIVE OPS API https live issues
  • rders, riders, etc.
LIVE ISSUES API
slide-48
SLIDE 48

ARCHITECTURE: FINAL?

DASHBOARD EVENT BUS live issues, etc. LIVE OPS API https live issues
  • rders, riders, etc.
LIVE ISSUES API https web sockets
slide-49
SLIDE 49

ENABLE TEAMS TO WORK LIKE STARTUPS

  • Identify problems in their area
  • Set goals and define metrics
  • Fast develop/test/deploy cycles
  • Evolve easily, but only when necessary
  • Succeed even with limited distributed experience
slide-50
SLIDE 50

gsb@deliveroo.co.uk @gregbeech