Xiaowei Wang Xiaowei Wang Jingxin Feng Jingxin Feng Mar 7 th , - - PowerPoint PPT Presentation

xiaowei wang xiaowei wang jingxin feng jingxin feng mar 7
SMART_READER_LITE
LIVE PREVIEW

Xiaowei Wang Xiaowei Wang Jingxin Feng Jingxin Feng Mar 7 th , - - PowerPoint PPT Presentation

Xiaowei Wang Xiaowei Wang Jingxin Feng Jingxin Feng Mar 7 th , 2011 Overview Overview Background k d Data Model API Architecture Architecture Users Linearly scalability Replication and Consistency Replication


slide-1
SLIDE 1

Xiaowei Wang Jingxin Feng Xiaowei Wang Jingxin Feng Mar 7th, 2011

slide-2
SLIDE 2

Overview Overview

k d

  • Background
  • Data Model
  • API
  • Architecture
  • Architecture
  • Users
  • Linearly scalability
  • Replication and Consistency

Replication and Consistency

  • Tradeoff
slide-3
SLIDE 3

Background Background

  • Cassandra is a highly scalable, eventually

consistent, distributed, structured key‐value y store.

  • Cassandra was open sourced by Facebook in
  • Cassandra was open sourced by Facebook in

2008, and it was designed to fullfill the storage f h b h bl needs of the Inbox Search problem. It is in production use at Facebook but is still under heavy development.

slide-4
SLIDE 4

Background Background

  • Cassandra is Dynamo and Bigtable’s lovechild.

Distributed systems technology Data model Dynamo BigTable Cassandra Distributed systems technology Data model

  • Like Dynamo Cassandra is eventually

Like Dynamo, Cassandra is eventually consistent; Like BigTable, Cassandra provides a C l F il b d d t d l ColumnFamily‐based data model.

slide-5
SLIDE 5

Data Model Data Model

  • Basic concepts:

– Cluster: the machines(nodes) in a logical Cassandra

  • instance. Cluster can contain multiple keyspaces.

– Keyspace: a namespace for ColumnFamilies, typically

  • ne per application.

– ColumnFamilies: contain multiple columns, each of p which has a name, value, and a time stamp, and which are referecenced by row keys. – SuperColumns: can be thought of as columns that themselves have sub columns.

slide-6
SLIDE 6

Data Model Data Model

  • Columns

– The column is lowest/smallest increment of data. / It is a tuple(triplet) that contains a name, a value and a timestamp. p – Example in Java:

slide-7
SLIDE 7

Data Model Data Model

  • Super Column

– A container for one or more columns

slide-8
SLIDE 8

Data Model Data Model

  • Column Families(CF)

– A container for columns, analogous to table in a relational database relational database. – The columnFamily has a name a map with a key and name, a map with a key and a value(which is a map containing columns) containing columns).

slide-9
SLIDE 9

Data Model Data Model

  • Column Families(CF)
slide-10
SLIDE 10

Data Model Data Model

slide-11
SLIDE 11

Data Model Data Model

  • SuperColumnFamily

– The largest container, g , instead of having Columns in the inner most Map, we have SuperColumns. p So it just adds an extra dimension.

slide-12
SLIDE 12

Data Model Data Model

  • Keyspaces

– The container for column families. From an RDBMS point of view you can compare this to the schema, normally you have one per application. , y y p pp

slide-13
SLIDE 13

API API

  • The Cassandra API consists of the following

three methods:

– insert(table, key, rowMutation) get(table key columnName) – get(table, key, columnName) – delete(table, key, columnName) columnName can refer to a specific column within a column family, a column family, a super column family or a column within a super column.

slide-14
SLIDE 14

API API

h if

  • Thrift

– Cassandra driver‐level interface that the clients below build on. NOT recommend…

  • High level clients:

g

– Python(Telephus, Pycassa…) – Java(Hector, Pelops…) Java(Hector, Pelops…) – .NET(FluentCassandra, Aquiles…) PHP(phpcassa SimpleCassie ) – PHP(phpcassa, SimpleCassie…) – Others…

slide-15
SLIDE 15

Architecture Architecture

  • Architecture layers

Core Layer Middle Layer Top Layer Messaging Service Commit log Tombstones g g Gossip Failure detection Cluster state g Memtable SSTable Indexes Hinted handoff Read repair Bootstrap Partitioner Replication Compaction Monitoring Admin tools

slide-16
SLIDE 16

Architecture Architecture

  • Write Path

– First write to a disk commit log(sequential) g( q ) – After write to log it is sent to approriate nodes Each node receiving write first records it in a local – Each node receiving write first records it in a local log, then makes update to memtables. bl fl h d di k h – Memtables are flushed to disk when

  • Out of space
  • Too many keys(128 is default)
  • Time duration(Client provided)
slide-17
SLIDE 17

Architecture Architecture

  • When memtables written out two files go out:

– DataFile(SSTable) ( ) – Index File(SSTable Index)

Wh it l h h d ll it l

  • When a commit log has had all its column

families pushed to disk, it is deleted

  • Compaction: Data files accumulate over time.

Periodically data files are merged sorted into a Periodically data files are merged sorted into a new file(and creates new index).

slide-18
SLIDE 18

Architecture Architecture

W it ti

  • Write properties:

– No reads No seeks – No seeks – Fast – Atomic within ColumnFamily – Atomic within ColumnFamily – Always writable

  • Read properties:

Read properties:

– Read multiple SSTables – Slower than writes(but still fast) Slower than writes(but still fast) – Seeks can be mitigated with more RAM – Scales to billions of rows

slide-19
SLIDE 19

Users Users

F b k

  • Facebook

– Uses Cassandra to power Inbox Search, with over 200 nodes deployed Abandoned in late 2010 nodes deployed. Abandoned in late 2010.

  • Twitter

– But not for tweets But not for tweets.

  • IBM

– Research in building a scalable email system based on Research in building a scalable email system based on Cassandra

  • Cisco’s WebEx

– Uses Cassandra to store user feed and activity in near real time.

slide-20
SLIDE 20

Next Topics Next Topics

1. Linearly scalability y y 2. Replication and Consistency 3 T d ff 3. Tradeoff

slide-21
SLIDE 21

Linearly Scalability Linearly Scalability

N3

N2

Nx

Key N1 y

slide-22
SLIDE 22

Bootstrap Bootstrap

N3 N4 N3

N2 N1

slide-23
SLIDE 23

Consistent Hashing Consistent Hashing

Cause a problem…

N3

N2

Nx

Key N1 y

slide-24
SLIDE 24

Load Balance Load Balance

N4 N3 N4

N2

N1

slide-25
SLIDE 25

Replication and Consistency Replication and Consistency

l Replication Tunable Eventually consistency u ab e e tua y co s ste cy

slide-26
SLIDE 26

Replication(Simple Case) Replication(Simple Case)

N4

N3

Key

N2 N1

slide-27
SLIDE 27

Tunable Consistency Tunable Consistency

Write(W) Read(R) Level Description Level Description ZERO Cross fingers N/A ANY 1st Response N/A ANY 1st Response (Including HH) N/A O 1st R O 1st R One 1st Response One 1st Response QUORUM N/2 + 1 l QUORUM N/2 + 1 Replicas Replicas ALL All Replicas ALL All Replicas

slide-28
SLIDE 28

A Quorum Level Example(1) A Quorum Level Example(1)

N=3

N1 Write Operation N2 N3

slide-29
SLIDE 29

A Quorum Level Example(2) A Quorum Level Example(2)

N=3

N1 Read Operation N2 N3

slide-30
SLIDE 30

A Quorum Level Example(3) A Quorum Level Example(3)

  • But…
slide-31
SLIDE 31

Final Question about Cassandra Final Question about Cassandra

Why write/read fast? (1) No read/write locks (1) No read/write locks (2) Organize all the write operations into a i l i hi h i i h sequential write which can maximize the disk’s throughput (3) Flexible Data Model

slide-32
SLIDE 32

Similarity with Dynamo and Bigtable

Dynamo‐like features

Similarity with Dynamo and Bigtable

Dynamo‐like features

  • a. Symmetric,P2P architecture

No Special nodes, No SPOF(Single Point Of l ) Failure)

  • b. Gossip Based cluster management

c Distributed hash table for data placement(DHT)

  • c. Distributed hash table for data placement(DHT)
  • d. Tunable and Eventual Consistency

BigTable‐like Features d l

  • a. Data Model
  • b. SSTable Disk Storage

Append‐only Commit Log Append only Commit Log MemTable (Buffer & Sort) Immutable SSTable Files H d I i (S id B d GFS)

  • c. Hadoop Integration(Some ideas Based on GFS)
slide-33
SLIDE 33