THE DISTRIBUTED PIT OF SUCCESS Greg Beech ABOUT ME Lead Engineer - - PowerPoint PPT Presentation
THE DISTRIBUTED PIT OF SUCCESS Greg Beech ABOUT ME Lead Engineer - - PowerPoint PPT Presentation
THE DISTRIBUTED PIT OF SUCCESS Greg Beech ABOUT ME Lead Engineer @ Deliveroo joined March 2015 Tech lead for international expansion Set up Deliveroo for Business Currently rebuilding our Live Operations tooling PAST
THE DISTRIBUTED PIT OF SUCCESS
Greg Beech
ABOUT ME
Lead Engineer @ Deliveroo — joined March 2015
- Tech lead for international expansion
- Set up Deliveroo for Business
- Currently rebuilding our Live Operations tooling
- Head of Platform Development @ blinkbox Books
- Principal Engineer @ blinkbox Movies
- Test Engineer @ Microsoft
ABOUT DELIVEROO
2013
RESTAURANT-QUALITY FOOD TO YOUR HOME FOUNDED
THEN NOW
FUNDING RAISED
$475M
2015 2017 2013
DAILY ORDERS
12 COUNTRIES 150 CITIES
2015 2017 2016 2014 2013
ENGINEERS
600,000 SLOC 38,000 COMMITS 6,900 PULL REQUESTS 3,200 DEPLOYS
TECHNOLOGY CHALLENGES
ARCHITECTURE
2015 2017 2016
HEROKU APP LIMIT- BIGGEST. HEROKU. APP. EVER.
APP SERVERS
- General purpose models sub-optimal in most cases
- Caching difficult due to geo, availability, timings, etc.
- Long dyno boot time makes auto-scaling slow
- Constrained to Ruby on Rails
DEGRADING PERFORMANCE
DEGRADING PERFORMANCE
Restaurant List TTFB
2017 2h15 2016 25 min 2015 7 min 2014 4 min 2013 2 min
BUILD TIMES
https://xkcd.com/303/DEVELOPMENT PROCESS
master ticket staging QA- CI becomes part of dev workflow
- 70+ developers causes merge conflicts
- “God objects” are hard to understand
- “House of Cards” development in certain areas
REDUCING DEVELOPMENT VELOCITY
Pull Requests per Engineer
REDUCING DEVELOPMENT VELOCITY
- Single problem can bring everything down
- Placing orders
- Customer service
- Rider dispatch
- Rollbacks increasingly difficult with commit frequency
- PG replication, analyse and vacuum settings critical
DECREASING RELIABILITY
Uptime # Outages (Unscheduled)
DECREASING RELIABILITY
2017 2019 2015
IT’S NOT GOING TO GET EASIER
HOW DO WE FIX THIS?
MONITORING MONITORING CLIENT APPS CLIENT APPS EVENT BUS
LARGE SCALE ARCHITECTURE
DOMAIN SERVICES EDGE SERVICES
DOMAIN SERVICES
- Owns part of the domain
- Granular, purely RESTful APIs
- Send & receive from the bus
- Use other domain service APIs
EDGE SERVICES
- Does not own any of the domain
- Presents more aggregated API, implement search, etc.
- Receive-only from the bus
- Use domain or edge service APIs
1-4 SERVICES/APPS PER TEAM
12 FACTOR APPS
- One codebase, many deploys
- Scale out as stateless processes
- Dev/prod parity (time, personnel, tools)
- Find the rest at https://12factor.net/
DATA SHARING RULES
- No shared data stores — no exceptions
- All internal data exposed as REST APIs — no RPC
- Publish events when data changes — no payloads
ACTUAL REST WITH HYPERMEDIA
{ "_links": { "self": { "href": "https://api.deliveroo.com/orders/2457" }, "restaurant": { "href": "https://api.deliveroo.com/restaurants/203" }, "user": { "href": "https://api.deliveroo.com/users/814" } }, "id": 2457, "status": "placed", "deliver_from": "2016-09-21T09:25:00Z", "deliver_by": "2016-09-21T09:35:00Z", "scheduled": false }LINKS!!
THE N+1 SELECTS PROBLEM
DOMAIN GET /restaurants?geohash=gcpvhepze4b8 GET /restaurants/812 GET /restaurants/1873 GET /restaurants/203 GET /restaurants/1074 GET /restaurants/1309 GET /restaurants/2132 GET /restaurants/2493 GET /restaurants/2873 EDGETHE N+1 SELECTS SOLUTION?
GET /restaurants?geohash=gcpvhepze4b8 GET /restaurants/812 GET /restaurants/1873 GET /restaurants/203 GET /restaurants/1074 GET /restaurants/1309 GET /restaurants/2132 GET /restaurants/2493 GET /restaurants/2873 LOCAL CACHE EDGE DOMAINCACHE CORRECTNESS
GET /restaurants/203 EDGE DOMAIN GET /restaurants/203 LOCAL CACHE If-None-Match: "xxx"ETag: High consistency but high latency Cache-Control: Low latency but low consistency
GET /restaurants/203 EDGE DOMAIN GET /restaurants/203 LOCAL CACHE (only if expired)REPRESENTATIONAL STATE NOTIFICATION (RESN)
- Send CREATE/UPDATE/DELETE events for entities
- Low latency and high consistency
WHY NO PAYLOADS?
- Transfer of authority from service to bus
- Encourages incomplete domain modelling
- Bus becomes a critical source for data loss
- Need to handle multiple representations in consumers
WHAT ABOUT STREAMS?
- Tens of millions of location/availability pings per day
- Nonsensical to model as entities with identity
- Non-critical immutable value objects may be sent in payloads
LANGUAGES AND TOOLS
WHICH LANGUAGE SHOULD WE USE?
WE LOVE RUBY
Easier migration path from existing codebase Well known within the company Performant enough for most applications Quick to write, test and iterate No need to argue over approaches & standards
ROO ON RAILS
$ rails new my_service --database=postgresql $ cd my_service $ echo "gem 'roo_on_rails'" >> Gemfile $ bundle && bundle exec roo_on_railsCASE STUDY: LIVE OPERATIONS
WHAT IS LIVE OPS?
- Manual intervention to get orders delivered
- Finding and resolving issues with orders, riders, etc.
- Like software, easier to fix things the earlier you find them
LIVE OPERATIONS HISTORICALLY
- rder id time restaurant address zone status
REPEATED SCANNING
WHAT ARE OUR GOALS?
- Reduce investigations per order by 93%
- Reduce or hold “unacceptable” orders
- Give visibility into live issues
LIVE OPERATIONS NOW: BACKEND
- rders
LIVE OPERATIONS NOW: DASHBOARD
ARCHITECTURE: NOW
LIVE OPS API DASHBOARD EVENT BUS- rders, riders, etc.
ARCHITECTURE: NEXT
DASHBOARD EVENT BUS live issues, etc. LIVE OPS API https live issues- rders, riders, etc.
ARCHITECTURE: FINAL?
DASHBOARD EVENT BUS live issues, etc. LIVE OPS API https live issues- rders, riders, etc.
ENABLE TEAMS TO WORK LIKE STARTUPS
- Identify problems in their area
- Set goals and define metrics
- Fast develop/test/deploy cycles
- Evolve easily, but only when necessary
- Succeed even with limited distributed experience
gsb@deliveroo.co.uk @gregbeech