Alexander Dejanovski @alexanderdeja Consultant - - PowerPoint PPT Presentation

alexander dejanovski alexanderdeja consultant
SMART_READER_LITE
LIVE PREVIEW

Alexander Dejanovski @alexanderdeja Consultant - - PowerPoint PPT Presentation

Real world tales of repair APACHE BIGDATA - MAY 2017 Alexander Dejanovski @alexanderdeja Consultant www.thelastpickle.com Datastax MVP for Apache Cassandra Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License


slide-1
SLIDE 1

Real world tales of repair

slide-2
SLIDE 2

APACHE BIGDATA - MAY 2017

Alexander Dejanovski @alexanderdeja Consultant www.thelastpickle.com Datastax MVP for Apache Cassandra

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
slide-3
SLIDE 3

About The Last Pickle


We help people deliver and improve Apache Cassandra based solutions. With staff in 5 countries : New Zealand, Australia, France, Spain, USA

slide-4
SLIDE 4

What and why ? Full repair Incremental repair How to make it work

www.thelastpickle.com

slide-5
SLIDE 5

What is repair ?

A maintenance operation that (briefly) restores strong consistency throughout the cluster

www.thelastpickle.com

slide-6
SLIDE 6

Why do we need repair ?


  • Eventual consistency
  • Downtime / failure

recovery

  • Safe deletes

www.thelastpickle.com

slide-7
SLIDE 7

Tombstones need repair too
 


Missing tombstones can lead to zombie data (repair within gc_grace_seconds)

www.thelastpickle.com

slide-8
SLIDE 8

Tombstones need repair too
 


www.thelastpickle.com

slide-9
SLIDE 9

Tombstones need repair too
 


www.thelastpickle.com

slide-10
SLIDE 10

Tombstones need repair too
 


www.thelastpickle.com

slide-11
SLIDE 11

Tombstones need repair too
 


www.thelastpickle.com

slide-12
SLIDE 12

Tombstones need repair too
 


www.thelastpickle.com

slide-13
SLIDE 13

Tombstones need repair too
 


www.thelastpickle.com

slide-14
SLIDE 14

What and why ? Full repair Incremental repair How to make it work

www.thelastpickle.com

slide-15
SLIDE 15

How does anti-entropy repair works ?

Reads all data

www.thelastpickle.com

slide-16
SLIDE 16

How does anti-entropy repair works ?

Reads all data Calculates hashes

www.thelastpickle.com

slide-17
SLIDE 17

How does anti-entropy repair works ?

Reads all data Calculates hashes Compares hashes

www.thelastpickle.com

slide-18
SLIDE 18

How does anti-entropy repair works ?

Reads all data Calculates hashes Compares hashes Streams mismatching partitions

www.thelastpickle.com

slide-19
SLIDE 19

How does anti-entropy repair works ?

www.thelastpickle.com

slide-20
SLIDE 20

Merkle tree is requested to all replicas

www.thelastpickle.com

slide-21
SLIDE 21

Validation compaction

www.thelastpickle.com

slide-22
SLIDE 22

Merkle tree comparison

www.thelastpickle.com

slide-23
SLIDE 23

Streaming

www.thelastpickle.com

slide-24
SLIDE 24

How do we run repair ?


 nodetool repair

www.thelastpickle.com

slide-25
SLIDE 25

Improving repair

www.thelastpickle.com

slide-26
SLIDE 26

Improving repair

www.thelastpickle.com

slide-27
SLIDE 27

Improving repair

www.thelastpickle.com

slide-28
SLIDE 28

Improving repair

repairing each range once is enough

www.thelastpickle.com

slide-29
SLIDE 29

Improving repair


 nodetool repair -pr

www.thelastpickle.com

slide-30
SLIDE 30

Improving repair


 nodetool repair -pr not suitable for node recovery

www.thelastpickle.com

slide-31
SLIDE 31

Sequential or parallel ?

Sequential : takes a snapshot on all replicas and computes merkle trees one replica at a time (on the snapshots)

www.thelastpickle.com

slide-32
SLIDE 32

Sequential or parallel ?

Parallel : No snapshot, all replicas compute merkle trees at the same time

www.thelastpickle.com

slide-33
SLIDE 33

Repair too slow ?


 Sequential repair is the default since C* 2.0

www.thelastpickle.com

slide-34
SLIDE 34

Repair too slow ?


 nodetool repair -par

www.thelastpickle.com

slide-35
SLIDE 35

The problem with dense nodes

Overstreaming Leaves of the Merkle tree contain several partitions. 32k leaves at most.

www.thelastpickle.com

slide-36
SLIDE 36

The solutions with dense nodes

cassandra_range_repair (Matt Stump & Brian Gallew) Breaks the repair sessions in n steps Cassandra reaper (Spotify)
 Full orchestration tool for repairs + sub range repair support

www.thelastpickle.com

slide-37
SLIDE 37

The solutions with dense nodes

vnodes : one repair session per vnode Drawback : if you have many vnodes, repair takes longer

www.thelastpickle.com

slide-38
SLIDE 38

Repair in…

www.thelastpickle.com

slide-39
SLIDE 39

The early days of your cluster

Node density is low, repair works just fine however you run it.

www.thelastpickle.com

slide-40
SLIDE 40

The early days of your cluster

So maybe like I did, you run « nodetool repair »

  • n all nodes… at the same

time

www.thelastpickle.com

slide-41
SLIDE 41

The (not so) early days of your cluster

As nodes gets higher in density, repair takes longer… and longer…

www.thelastpickle.com

slide-42
SLIDE 42

The (not so) early days of your cluster

… and latencies rise as repair is a CPU and I/O intensive operation

www.thelastpickle.com

slide-43
SLIDE 43

Your cluster is a grown up now

… until it breaks your cluster

www.thelastpickle.com

slide-44
SLIDE 44

How can it break ?

Load gets too high

www.thelastpickle.com

slide-45
SLIDE 45

How can it break ?

Load gets too high You don’t meet your latency SLA anymore

www.thelastpickle.com

slide-46
SLIDE 46

How can it break ?

Load gets too high

www.thelastpickle.com

slide-47
SLIDE 47

How can it break ?

Load gets too high Streams get stuck

www.thelastpickle.com

slide-48
SLIDE 48

How can it break ?

Load gets too high Streams get stuck and out of nowhere, all nodes start to eat all your CPU doing nothing

www.thelastpickle.com

slide-49
SLIDE 49

The fun part ?

You need to run repair to recover from the repair outage !

www.thelastpickle.com

slide-50
SLIDE 50

The cluster keeps growing

And you realize orchestration is needed to stop blowing up your cluster

www.thelastpickle.com

slide-51
SLIDE 51

Orchestrating repair

Repair must not run on all nodes at the same time

www.thelastpickle.com

slide-52
SLIDE 52

Tools to orchestrate repairs

OpsCenter repair service (DSE users) Cassandra reaper

www.thelastpickle.com

slide-53
SLIDE 53

Cassandra reaper https://github.com/spotify/cassandra-reaper https://github.com/thelastpickle/cassandra-reaper

www.thelastpickle.com

slide-54
SLIDE 54

Cassandra reaper

Performs subrange repair

www.thelastpickle.com

slide-55
SLIDE 55

Cassandra reaper

Performs subrange repair Limits repair pressure

www.thelastpickle.com

slide-56
SLIDE 56

Cassandra reaper

Performs subrange repair Limits repair pressure Retries failed sessions

www.thelastpickle.com

slide-57
SLIDE 57

Cassandra reaper

Performs subrange repair Limits repair pressure Retries failed sessions (auto-)Schedules cyclic repairs

www.thelastpickle.com

slide-58
SLIDE 58

Cassandra reaper

Performs subrange repair Limits repair pressure Retries failed sessions (auto-)Schedules cyclic repairs Optimizes cluster load

www.thelastpickle.com

slide-59
SLIDE 59

Cassandra reaper - with UI (thx Stefan Podkowinski)

GUI screenshots

www.thelastpickle.com

slide-60
SLIDE 60

What and why ? Full repair Incremental repair How to make it work Automated repairs

www.thelastpickle.com

slide-61
SLIDE 61

What if we stopped repairing repaired data ?

www.thelastpickle.com

slide-62
SLIDE 62

Here comes the savior !

C* 2.1 introduces incremental repair Default repair mode since C* 2.2

www.thelastpickle.com

slide-63
SLIDE 63

How does incremental repair work ?

www.thelastpickle.com

slide-64
SLIDE 64

Anticompaction

www.thelastpickle.com

slide-65
SLIDE 65

Anticompaction (repair on all ranges on local node)

www.thelastpickle.com

slide-66
SLIDE 66

Incremental repair looks awesome…

…but has flaws and drawbacks

www.thelastpickle.com

slide-67
SLIDE 67

Incremental repair caveats

Carefully prepare your switch to incremental repair

www.thelastpickle.com

slide-68
SLIDE 68

Incremental repair caveats

Carefully prepare your switch to incremental repair i.e. do not run « nodetool repair -inc » straight away…

www.thelastpickle.com

slide-69
SLIDE 69

Incremental repair caveats

It doesn’t handle missing/corrupted data that was already repaired

www.thelastpickle.com

slide-70
SLIDE 70

Incremental repair caveats

It splits SSTables in 2 sets
 that cannot be compacted together (think tombstone purge)

www.thelastpickle.com

slide-71
SLIDE 71

Incremental repair caveats

It is incompatible with subrange repair (anticompaction)

www.thelastpickle.com

slide-72
SLIDE 72

Incremental repair caveats

It doesn’t like concurrency very much

www.thelastpickle.com

slide-73
SLIDE 73

Incremental repair caveats

Validator.java:261 - Failed creating a merkle tree for [repair #e4c782d0-11fc-11e6- b616-51a3849870bb on table_v2/table_attributes, [(8835460833482333317,8838777311566358575], (-7300486781514672850,-7298192396576668423], (-959298474675167225,-959177964106074209]]], /10.10.10.33 (see log for details)

www.thelastpickle.com

slide-74
SLIDE 74

Incremental repair caveats

CompactionManager.java:1320 - Cannot start multiple repair sessions over the same sstables

www.thelastpickle.com

slide-75
SLIDE 75

Incremental repair caveats

CASSANDRA-8316 A running anticompation prevents validation compaction

www.thelastpickle.com

slide-76
SLIDE 76

Incremental repair caveats

Do not use -pr with incremental repair

www.thelastpickle.com

slide-77
SLIDE 77

Incremental repair caveats

Do not use -pr with incremental repair

Useless : data is repaired once only

www.thelastpickle.com

slide-78
SLIDE 78

Incremental repair caveats

Do not use -pr with incremental repair

Useless : data is repaired once only anyway Misleading : anticompaction partially disabled

www.thelastpickle.com

slide-79
SLIDE 79

Incremental repair bugs

CASSANDRA-11696

Fixed in 2.1.15, 2.2.7, 3.0.8, 3.8
 Incremental repairs can mark too many ranges as repaired

www.thelastpickle.com

slide-80
SLIDE 80

Incremental repair bugs

CASSANDRA-13153


Fixed in 2.2.10, 3.0.13, 3.11.0, 4.0 
 Reappearing Data when Mixing Incremental and Full Repairs 


www.thelastpickle.com

slide-81
SLIDE 81

Incremental repair bugs

CASSANDRA-9143


Fix planned for 4.0 
 SSTables marked as repaired on some nodes only 
 Because : node can fail during anti compaction

  • r : SSTables can get compacted during repair

www.thelastpickle.com

slide-82
SLIDE 82

Incremental repair bugs

CASSANDRA-10446


Fix planned for 4.0 
 Spotted by Paulo Motta in the comments : SSTables are streamed with a repairedAt value.

www.thelastpickle.com

slide-83
SLIDE 83

Incremental repair will not…

Fix a poor repair strategy

www.thelastpickle.com

slide-84
SLIDE 84

Incremental repair will not…

Prevent you from having to run full repair

www.thelastpickle.com

slide-85
SLIDE 85

Reaper does support incremental repair

github.com/thelastpickle

www.thelastpickle.com

slide-86
SLIDE 86

Reaper and incremental repair

No subrange repair

www.thelastpickle.com

slide-87
SLIDE 87

Reaper and incremental repair

No subrange repair Single repair thread => no concurrency

www.thelastpickle.com

slide-88
SLIDE 88

What and why ? Full repair Incremental repair How to make it work

www.thelastpickle.com

slide-89
SLIDE 89

Repair best practices

Put your repair strategy in place on day 1

www.thelastpickle.com

slide-90
SLIDE 90

Repair best practices

Use appropriate tooling or build your own

www.thelastpickle.com

slide-91
SLIDE 91

Repair best practices

Spread repair over a gc_grace_seconds cycle

www.thelastpickle.com

slide-92
SLIDE 92

Repair best practices

Adjust repair pressure

  • n your cluster

(Reaper does that)

www.thelastpickle.com

slide-93
SLIDE 93

Repair best practices

Don’t repair everything ! Pick tables with deletes and those with critical data

www.thelastpickle.com

slide-94
SLIDE 94

Repair best practices

If every data is critical, then none is ;)

www.thelastpickle.com

slide-95
SLIDE 95

Repair best practices

Be tight on your schedule with inc repair Tombstones and anticompaction

www.thelastpickle.com

slide-96
SLIDE 96

Repair best practices

Avoid concurrency with inc repair One node at a time

www.thelastpickle.com

slide-97
SLIDE 97

Repair best practices

Wait for 4.0.x before moving to incremental repair…?

www.thelastpickle.com

slide-98
SLIDE 98

Thanks!

@alexanderdeja