Runaway complexity in Big Data And a plan to stop it Nathan Marz - - PowerPoint PPT Presentation

runaway complexity in big data
SMART_READER_LITE
LIVE PREVIEW

Runaway complexity in Big Data And a plan to stop it Nathan Marz - - PowerPoint PPT Presentation

Runaway complexity in Big Data And a plan to stop it Nathan Marz @nathanmarz 1 Agenda Common sources of complexity in data systems Design for a fundamentally better data system What is a data system? A system that manages the storage


slide-1
SLIDE 1

Runaway complexity in Big Data

Nathan Marz @nathanmarz

1

And a plan to stop it

slide-2
SLIDE 2

Agenda

  • Common sources of complexity in data systems
  • Design for a fundamentally better data system
slide-3
SLIDE 3

What is a data system?

A system that manages the storage and querying of data

slide-4
SLIDE 4

What is a data system?

A system that manages the storage and querying of data with a lifetime measured in years

slide-5
SLIDE 5

What is a data system?

A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist

slide-6
SLIDE 6

What is a data system?

A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure

slide-7
SLIDE 7

What is a data system?

A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure, and every human mistake ever made

slide-8
SLIDE 8

Common sources of complexity

Lack of human fault-tolerance Schemas done wrong Conflation of data and queries

slide-9
SLIDE 9

Lack of human fault-tolerance

slide-10
SLIDE 10

Human fault-tolerance

  • Bugs will be deployed to production over the lifetime of a data system
  • Operational mistakes will be made
  • Humans are part of the overall system, just like your hard disks, CPUs,

memory, and software

  • Must design for human error like you’d design for any other fault
slide-11
SLIDE 11

Human fault-tolerance

Examples of human error

  • Accidentally delete data from database
  • Deploy a bug that increments counters by two instead of by one
  • Accidental DOS on important internal service
slide-12
SLIDE 12

The worst consequence is data loss or data corruption

slide-13
SLIDE 13

As long as an error doesn’t lose or corrupt good data, you can fix what went wrong

slide-14
SLIDE 14

Mutability

  • The U and D in CRUD
  • A mutable system updates the current state of the world
  • Mutable systems inherently lack human fault-tolerance
  • Easy to corrupt or lose data
slide-15
SLIDE 15

Immutability

  • An immutable system captures a historical record of events
  • Each event happens at a particular time and is always true
slide-16
SLIDE 16

Capturing change with mutable data model

Person Location Sally Philadelphia Bob Chicago Person Location Sally New York Bob Chicago

Sally moves to New York

slide-17
SLIDE 17

Capturing change with immutable data model

Person Location Time Sally Philadelphia 1318358351 Bob Chicago 1327928370 Person Location Time Sally Philadelphia 1318358351 Bob Chicago 1327928370 Sally New York 1338469380

Sally moves to New York

slide-18
SLIDE 18

Immutability greatly restricts the range of errors that can cause data loss or data corruption

slide-19
SLIDE 19

Vastly more human fault-tolerant

slide-20
SLIDE 20

Immutability

Other benefits

  • Fundamentally simpler
  • CR instead of CRUD
  • Only write operation is appending new units of data
  • Easy to implement on top of a distributed filesystem
  • File = list of data records
  • Append = Add a new file into a directory
slide-21
SLIDE 21

Immutability

Please watch Rich Hickey’s talks to learn more about the enormous benefits of immutability

slide-22
SLIDE 22

Conflation of data and queries

slide-23
SLIDE 23

Conflation of data and queries

Normalization vs. denormalization

ID Name Location ID 1 Sally 3 2 George 1 3 Bob 3 Location ID City State Population 1 New York NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M

Normalized schema

slide-24
SLIDE 24

Join is too expensive, so denormalize...

slide-25
SLIDE 25

ID Name Location ID City State 1 Sally 3 Chicago IL 2 George 1 New York NY 3 Bob 3 Chicago IL Location ID City State Population 1 New York NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M

Denormalized schema

slide-26
SLIDE 26

Obviously, you prefer all data to be fully normalized

slide-27
SLIDE 27

But you are forced to denormalize for performance

slide-28
SLIDE 28

Because the way data is modeled, stored, and queried is complected

slide-29
SLIDE 29

We will come back to how to build data systems in which these are disassociated

slide-30
SLIDE 30

Schemas done wrong

slide-31
SLIDE 31

Schemas have a bad rap

slide-32
SLIDE 32

Schemas

  • Hard to change
  • Get in the way
  • Add development overhead
  • Requires annoying configuration
slide-33
SLIDE 33

I know! Use a schemaless database!

slide-34
SLIDE 34

This is an overreaction

slide-35
SLIDE 35

Confuses the poor implementation

  • f schemas with the value that

schemas provide

slide-36
SLIDE 36

What is a schema exactly?

slide-37
SLIDE 37

function(data unit)

slide-38
SLIDE 38

That says whether this data is valid

  • r not
slide-39
SLIDE 39

This is useful

slide-40
SLIDE 40

Value of schemas

  • Structural integrity
  • Guarantees on what can and can’t be stored
  • Prevents corruption
slide-41
SLIDE 41

Otherwise you’ll detect corruption issues at read-time

slide-42
SLIDE 42

Potentially long after the corruption happened

slide-43
SLIDE 43

With little insight into the circumstances of the corruption

slide-44
SLIDE 44

Much better to get an exception where the mistake is made, before it corrupts the database

slide-45
SLIDE 45

Saves enormous amounts of time

slide-46
SLIDE 46

Why are schemas considered painful?

  • Changing the schema is hard (e.g., adding a column to a table)
  • Schema is overly restrictive (e.g., cannot do nested objects)
  • Require translation layers (e.g. ORM)
  • Requires more typing (development overhead)
slide-47
SLIDE 47

None of these are fundamentally linked with function(data unit)

slide-48
SLIDE 48

These are problems in the implementation of schemas, not in schemas themselves

slide-49
SLIDE 49

Ideal schema tool

  • Data is represented as maps
  • Schema tool is a library that helps construct the schema function:
  • Concisely specify required fields and types
  • Insert custom validation logic for fields (e.g. ages are between 0 and 200)
  • Built-in support for evolving the schema over time
  • Fast and space-efficient serialization/deserialization
  • Cross-language

this is easy to use and gets out of your way i use apache thrift, but it lacks the custom validation logic i think it could be done better with a clojure-like data as maps approach given that parameters of a data system: long-lived, ever changing, with mistakes being made, the amount of work it takes to make a schema (not that much) is absolutely worth it

slide-50
SLIDE 50

Let’s get provocative

slide-51
SLIDE 51

The relational database will be a footnote in history

slide-52
SLIDE 52

Not because of SQL, restrictive schemas, or scalability issues

slide-53
SLIDE 53

But because of fundamental flaws in the RDBMS approach to managing data

slide-54
SLIDE 54

Mutability

slide-55
SLIDE 55

Conflating the storage of data with how it is queried

slide-56
SLIDE 56

“NewSQL” is misguided

slide-57
SLIDE 57

Let’s use our ability to cheaply store massive amounts of data

slide-58
SLIDE 58

To do data right

slide-59
SLIDE 59

And not inherit the complexities

  • f the past
slide-60
SLIDE 60

if SQL’s wrong, and NoSQL isn’t SQL, then NoSQL must be right

I know! Use a NoSQL database!

slide-61
SLIDE 61

NoSQL databases are generally not a step in the right direction

slide-62
SLIDE 62

Some aspects are, but not the

  • nes that get all the attention
slide-63
SLIDE 63

Still based on mutability and not general purpose

slide-64
SLIDE 64

Let’s start from scratch

Let’s see how you design a data system that doesn’t suffer from these complexities

slide-65
SLIDE 65

What does a data system do?

slide-66
SLIDE 66

Retrieve data that you previously stored?

Get Put

slide-67
SLIDE 67

Not really...

slide-68
SLIDE 68

Counterexamples

Store location information on people Where does Sally live? What are the most populous locations? How many people live in a particular location?

slide-69
SLIDE 69

Counterexamples

Store pageview information How many unique visitors over time? How many pageviews on September 2nd?

slide-70
SLIDE 70

Counterexamples

Store transaction history for bank account How much money do people spend on housing? How much money does George have?

slide-71
SLIDE 71

What does a data system do?

Query = Function(All data)

slide-72
SLIDE 72

Sometimes you retrieve what you stored

slide-73
SLIDE 73

Oftentimes you do transformations, aggregations, etc.

slide-74
SLIDE 74

Queries as pure functions that take all data as input is the most general formulation

slide-75
SLIDE 75

Example query

Total number of pageviews to a URL over a range of time

slide-76
SLIDE 76

Example query

Implementation

slide-77
SLIDE 77

Too slow: “all data” is petabyte-scale

slide-78
SLIDE 78

On-the-fly computation

All data

Query

slide-79
SLIDE 79

Precomputation

All data

Precomputed view Query

slide-80
SLIDE 80

Precomputed view

Example query

All data

Pageview Pageview Pageview Pageview Pageview

Query

2930

slide-81
SLIDE 81

Precomputation

All data

Precomputed view Query

slide-82
SLIDE 82

Precomputation

All data

Precomputed view Query

Function Function

slide-83
SLIDE 83

Data system

All data

Precomputed view Query

Function Function

Two problems to solve

slide-84
SLIDE 84

Data system

All data

Precomputed view Query

Function Function

How to compute views

slide-85
SLIDE 85

Data system

All data

Precomputed view Query

Function Function

How to compute queries from views

slide-86
SLIDE 86

Computing views

All data

Precomputed view

Function

slide-87
SLIDE 87

Function that takes in all data as input

slide-88
SLIDE 88

Batch processing

slide-89
SLIDE 89

MapReduce

slide-90
SLIDE 90

MapReduce is a framework for computing arbitrary functions on arbitrary data

slide-91
SLIDE 91

Expressing those functions

Cascalog Scalding

slide-92
SLIDE 92

All data

Batch view #1 Batch view #2

MapReduce workflow M a p R e d u c e w

  • r

k fl

  • w

MapReduce precomputation

slide-93
SLIDE 93

Batch views are optimized for the queries they serve

slide-94
SLIDE 94

Batch views

  • Batch-writable from MapReduce
  • Fast random reads
  • Examples: ElephantDB, Voldemort
slide-95
SLIDE 95

Batch view database

No random writes required!

slide-96
SLIDE 96

Properties

All data

Batch view

Function

Simple

ElephantDB is only a few thousand lines of code

slide-97
SLIDE 97

Properties

All data

Batch view

Function

Scalable

slide-98
SLIDE 98

Properties

All data

Batch view

Function

Highly available

slide-99
SLIDE 99

Properties

All data

Batch view

Function

Can be heavily optimized (b/c no random writes)

slide-100
SLIDE 100

Properties

All data

Batch view

Function

Normalized

slide-101
SLIDE 101

Properties

All data

Batch view

Function

“Denormalized”

Not exactly denormalization, because you’re doing more than just retrieving data that you stored (can do aggregations) You’re able to optimize data storage separately from data modeling, without the complexity typical of denormalization in relational databases *This is because the batch view is a pure function of all data* -> hard to get out of sync, and if there’s ever a problem (like a bug in your code that computes the wrong batch view) you can recompute also easy to debug problems, since you have the input that produced the batch view -> this is not true in a mutable system based

  • n incremental updates
slide-102
SLIDE 102

So we’re done, right?

slide-103
SLIDE 103

Not quite...

  • A batch workflow is too slow
  • Views are out of date

Absorbed into batch views Not absorbed

Now

Time

Just a few hours

  • f data!
slide-104
SLIDE 104

What’s left?

Precompute views for last few hours of data

slide-105
SLIDE 105

Application queries

Realtime view Batch view

Merge

slide-106
SLIDE 106

NoSQL databases

New data stream

Realtime view #1 Realtime view #2

Stream processor

slide-107
SLIDE 107

Precomputation

All data

Precomputed view Query

slide-108
SLIDE 108

Precomputation

All data

Precomputed batch view Query Precomputed realtime view

New data stream

“Lambda Architecture”

slide-109
SLIDE 109

Precomputation

All data

Precomputed batch view Query Precomputed realtime view

New data stream Most complex part of system

slide-110
SLIDE 110

Precomputation

All data

Precomputed batch view Query Precomputed realtime view

New data stream Random write dbs much more complex

This is where things like vector clocks have to be dealt with if using eventually consistent NoSQL database

slide-111
SLIDE 111

Precomputation

All data

Precomputed batch view Query Precomputed realtime view

New data stream But only represents few hours of data

slide-112
SLIDE 112

Precomputation

All data

Precomputed batch view Query Precomputed realtime view

New data stream If anything goes wrong, auto-corrects

Can continuously discard realtime views, keeping them small

slide-113
SLIDE 113

Precomputation

All data

Precomputed batch view Query Precomputed realtime view

New data stream “Complexity isolation”

Can continuously discard realtime views, keeping them small

slide-114
SLIDE 114

Eventual accuracy

Sometimes hard to compute exact answer in realtime

slide-115
SLIDE 115

Eventual accuracy

Example: unique count

slide-116
SLIDE 116

Eventual accuracy

Can compute exact answer in batch layer and approximate answer in realtime layer

Though for functions which can be computed exactly in the realtime layer (e.g. counting), you can achieve full accuracy

slide-117
SLIDE 117

Eventual accuracy

Best of both worlds of performance and accuracy

slide-118
SLIDE 118

Tools

All data

Precomputed batch view Query Precomputed realtime view

New data stream MapReduce Storm

“Lambda Architecture”

Storm ElephantDB, Voldemort Cassandra, Riak, HBase Kafka HDFS

slide-119
SLIDE 119

Lambda Architecture

  • Can discard batch views and realtime views and recreate everything

from scratch

  • Data storage layer optimized independently from query resolution layer
  • Mistakes corrected via recomputation

what mistakes can be made?

  • write bad data? - remove the data and recompute the

views

  • bug in the functions that compute view? - recompute the

view

  • bug in query function? just deploy the fix
slide-120
SLIDE 120

Lambda Architecture

  • Batch and realtime views can be swapped for other stores as needed
  • Function(All data) basis means it will support your future needs

what mistakes can be made?

  • write bad data? - remove the data and recompute the

views

  • bug in the functions that compute view? - recompute the

view

  • bug in query function? just deploy the fix
slide-121
SLIDE 121

Learn more

http://manning.com/marz

slide-122
SLIDE 122

Questions?

Thanks to Gary Fredericks for the dongle!