[PDF] - Comunicazione nei Sistemi Distribuiti Parte 2 Corso di Sistemi PDF Document

SLIDE 1

Comunicazione nei Sistemi Distribuiti

Parte 2

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

Corso di Sistemi Distribuiti e Cloud Computing A.A. 2018/19 Valeria Cardellini

Comunicazione orientata ai messaggi

RPC migliora la trasparenza della distribuzione
Ma non è un meccanismo sempre adatto a

supportare la comunicazione in un SD

– Ad es. quando non si può essere certi che il destinatario sia in esecuzione

Alternativa: comunicazione orientata ai messaggi

– Di tipo transiente

Berkeley socket: già esaminata in altri corsi
Message Passing Interface (MPI)

– Di tipo persistente

Message Oriented Middleware (MOM)

Valeria Cardellini – SDCC 2018/19 1

SLIDE 2

Message Passing Interface (MPI)

Libreria per lo scambio di messaggi tra processi in

esecuzione su nodi diversi

– Specifica della sola interfaccia (http://www.mpi-forum.org/) – Diverse implementazioni, tra cui Open MPI (http:// www.open-mpi.org/) e MPICH (http://www.mcs.anl.gov/ research/projects/mpich2/) – Standard de facto per la comunicazione tra i nodi di un sistema che esegue un programma parallelo sviluppato per un’architettura a memoria distribuita

MPI definisce una serie di primitive per la

comunicazione tra processi; in particolare:

– Primitive per la comunicazione punto-punto: per l’invio e la ricezione di un messaggio tra due processi diversi – Primitive per la comunicazione collettiva

Valeria Cardellini – SDCC 2018/19 2

Comunicazione punto-punto in MPI

Principali primitive per la comunicazione punto-punto:

– MPI_Send e MPI_Recv: comunicazione bloccante

MPI_Send con modalità sincrona o bufferizzata a seconda

dell’implementazione

– MPI_Bsend: invio bloccante bufferizzato – MPI_Ssend: invio sincrono bloccante – MPI_Isend e MPI_Irecv: comunicazione non bloccante

Primitive MPI Significato MPI_Bsend Aggiunge il messaggio in uscita ad un buffer per l’invio MPI_Send Invia il messaggio e aspetta finché non viene copiato in un buffer locale o remoto MPI_Ssend Invia il messaggio e aspetta finché non inizia la ricezione MPI_Isend Invia il riferimento al messaggio in uscita e continua MPI_Recv Riceve il messaggio; si blocca se non ce ne sono

Valeria Cardellini – SDCC 2018/19 3

SLIDE 3

Esempio di comunicazione in MPI

#include <stdio.h> #include <string.h> #include <mpi.h> int main (int argc, char **argv) { int myrank; char message[20]; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); printf("Il mio rank e' : %d\n", myrank); if (myrank == 0) { //Invia un messaggio al processo 1 strcpy(message, "PROVA"); MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD); printf("%d) Ho inviato: '%s'\n", myrank, message); } else if (myrank==1) { //Riceve il messaggio dal processo 0 MPI_Recv(message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status); printf("%d) Ho ricevuto: '%s'\n", myrank, message); } MPI_Finalize(); return 0; }

MPI_Send(buf, count, datatype, dest, tag, comm) MPI_Recv(buf, count, datatype, source, tag, comm, status)

Valeria Cardellini – SDCC 2018/19 4

Message-oriented middleware

Communication middleware that supports sending

and receiving messages in a persistent way

Loose coupling among system/application

components

– Decoupling in time and space – Can also support synchronization decoupling

Two patterns:

– Message queue – Publish-subscribe (pub/sub)

And two related types of systems:

– Message queue system (MQS) – Pub/sub system

Valeria Cardellini – SDCC 2018/19 5

SLIDE 4

Queue message pattern

Messages are put into queue
Multiple consumers can read from the queue
Each message is delivered to only one consumer
Principles

– Loose coupling – Service statelessness

Services minimize resource consumption by deferring the

management of state information when necessary

Apps:

– Task scheduling, load balancing, collaboration

Valeria Cardellini – SDCC 2018/19 6

Queue message pattern

Valeria Cardellini – SDCC 2018/19 7

A sends a message to B B issues a response message back to A

SLIDE 5

Message queue API

Basic interface to a queue in a MQS:

– put: nonblocking send

Append a message to a specified queue

– get: blocking receive

Block until the specified queue is nonempty and remove the

first message

Variations: allow searching for a specific message in the

queue, e.g., using a matching pattern

– poll: nonblocking receive

Check a specified queue for message and remove the first
Never block

– notify: nonblocking receive

Install a handler (callback function) to be automatically

called when a message is put into the specified queue

8 Valeria Cardellini – SDCC 2018/19

Publish/subscribe pattern

Valeria Cardellini – SDCC 2018/19 9

Application components can publish asynchronous

messages (e.g., event notifications), and/or declare their interest in message topics by issuing a subscription

SLIDE 6

Publish/subscribe pattern

Valeria Cardellini – SDCC 2018/19 10

Multiple consumers can subscribe to topic with or

without filters

Subscriptions are collected by an event dispatcher

component, responsible for routing events to all matching subscribers

– For scalability reasons, its implementation can be distributed

High degree of decoupling among components

– Easy to add and remove components – Appropriate for dynamic environments

Publish/subscribe pattern

A sibling of message queue pattern but further

generalizes it by delivering a message to multiple consumers

– Message queue: delivers messages to only one receiver, i.e., one-to-one communication – Pub/sub channel: delivers messages to multiple receivers, i.e., one-to-many communication

11 Valeria Cardellini – SDCC 2018/19

SLIDE 7

Publish/subscribe API

Calls that capture the core of any pub/sub system:

– publish(event): to publish an event

Events can be of any data type supported by the given

implementation languages and may also contain meta-data

– subscribe(filter expr, notify_cb, expiry) → sub handle: to subscribe to an event

Takes a filter expression, a reference to a notify callback for

event delivery, and an expiry time for the subscription registration.

Returns a subscription handle

– unsubscribe(sub handle) – notify_cb(sub_handle, event): called by the pub/sub system to deliver a matching event

12 Valeria Cardellini – SDCC 2018/19

MOM functionalities

MOM handles the complexity of addressing,

routing, availability of communicating application components (or applications), and message format transformations

Source: “Cloud Computing Patterns”, http://bit.ly/2hZv6Xs

Valeria Cardellini – SDCC 2018/19 13

SLIDE 8

MOM functionalities

Let us analyze

– Semantics delivery – Message routing – Message transformations

Valeria Cardellini – SDCC 2018/19 14

Semantics delivery in MOM

At-least-once delivery

– How can MOM ensure that messages are received successfully? – By sending ack for each retrieved message and resending message if message is not received – Be careful: app should be tolerant to message duplications

Valeria Cardellini – SDCC 2018/19 15

SLIDE 9

Semantics delivery in MOM

Exactly-once delivery

– How can MOM ensure that a message is delivered only exactly once to a receiver? – By filtering possible message duplicates automatically – Upon creation, each message is associated with a unique message ID, which is used to filter message duplicates during their traversal from sender to receiver – Messages must also survive MOM components’ crashes

Valeria Cardellini – SDCC 2018/19 16

Semantics delivery in MOM

Transaction-based delivery

– How can MOM ensure that messages are only deleted from a message queue if they have been received successfully? – MOM and the receiver participate in a transaction: all

perations involved in the reception of a message are

performed under one transactional context guaranteeing ACID behavior

Valeria Cardellini – SDCC 2018/19 17

SLIDE 10

Semantics delivery in MOM

Timeout-based delivery

– How can MOM ensure that messages are only deleted from a message queue if they have been received successfully at least once? – Messages are not deleted immediately from the queue, but marked as being invisible – Invisible message cannot be read by another client – After client ack of message receipt, the message is deleted from the queue

Valeria Cardellini – SDCC 2018/19 18

Message routing: general model

Queues are managed by queue managers (QMs)

– An application can put messages only into a local queue – Getting a message is possible by extracting it from a local queue only

QMs need to route messages

– Function as message-queuing “relays” that interact with distributed applications & each other – Support the idea of an overlay network – Also special queue managers that operate as routers

Valeria Cardellini – SDCC 2018/19 19

SLIDE 11

Message routing: overlay network

Overlay network is used to route messages

– By using routing tables – Routing tables stored and managed by QMs

Valeria Cardellini – SDCC 2018/19 20

The overlay network needs

to be maintained over time

– Routing tables are usually set up and managed manually – Dynamic overlay networks require to dynamically manage the mapping between queue names and their location

Message transformation: message broker

New/existing apps that need to be integrated into a

single, coherent system rarely agree on a common data format

How to handle data heterogeneity?

– We have already examined different solutions in the context of RPCs

– Now let’s focus on the message broker

Message broker: component that usually takes care of

application heterogeneity in a MOM

Valeria Cardellini – SDCC 2018/19 21

SLIDE 12

Message broker: general architecture

Message broker handles application heterogeneity

– Converts incoming messages to target format providing access transparency – Very often acts as an application gateway – Manages a repository of conversion rules and programs to transform a message of one type to another – May provide subject-based routing capabilities – To be scalable and reliable can be implemented in a distributed way

Valeria Cardellini – SDCC 2018/19 22

MOM frameworks

Examples of MOM middleware

– IBM MQ – Microsoft Message Queueing (MSMQ) – Java Message Service (JMS): API MOM for Java – Open MQ – RabbitMQ – NATS https://nats.io – Apache ActiveMQ http://activemq.apache.org – Apache Kafka

Also Cloud-based products

– Amazon Simple Queue Service (SQS) – Google Cloud Pub/Sub

Not always a clear distinction between queue message

and pub/sub patterns

– Some frameworks (e.g., RabbitMQ, Kafka, NATS) support both – Others not (e.g., redis is only pub/sub)

Valeria Cardellini – SDCC 2018/19 23

SLIDE 13

Some examples of MOM usage

1. Accept and forward messages which are sent by a producer and received by a consumer 2. Distribute time-consuming tasks among multiple workers 3. Deliver messages to many consumers at once (pub/sub pattern) 4. Receive messages selectively 5. Run a function on a remote node and wait for the result

Valeria Cardellini – SDCC 2018/19 24

Source: RabbitMQ tutorial http://bit.ly/2zPPMJO

IBM MQ

The first enterprise messaging technology, from 1993
Basic concepts:

– Application-specific messages are put into and removed from queues – Queues reside under the regime of a queue manager (QM) – Processes can put messages only in local queues, or through an RPC mechanism

Message transfer

– Messages are transferred between queues – Message transfer between process queues requires a channel

At each endpoint of channel is a

message channel agent (MCA)

Valeria Cardellini – SDCC 2018/19 25

https://www.ibm.com/products/mq

MCAs are responsible for:

– Setting up channels – Sending/receiving messages – Also message encryption

SLIDE 14

IBM MQ (2)

Principles of operation:

– Channels are inherently unidirectional – Automatically start MCAs when messages arrive – Any network of queue managers can be created – Routes are set up manually (system administration) – Routing: by using logical names, in combination with name resolution to local queues, it is possible to route message to remote queue

Valeria Cardellini – SDCC 2018/19 26

Amazon Simple Queue Service (SQS)

Cloud-based message queue service based on polling

model

– Goal: to decouple the components of cloud applications – Message queues are hosted within AWS infrastructure – Messages are stored in queues for a limited period of time

Application components using SQS can run

independently and asynchronously and be developed with different technologies

Provides timeout-based delivery

– Messages are only deleted from a message queue if they have been received properly – A received message is locked during processing (visibility timeout); if processing fails, the lock expires and the message is available again

Can be combined with Amazon SNS

– To push a message to multiple SQS queues in parallel

Valeria Cardellini – SDCC 2018/19 27

SLIDE 15

Amazon SQS: API

CreateQueue, ListQueues, DeleteQueue

– Create, list, delete queues

SendMessage, ReceiveMessage

– Add/receive messages to/from a specified queue (message size up to 256 KB)

DeleteMessage

– Remove a received message from a specified queue (the component must delete the message after receiving and processing it)

ChangeMessageVisibility

– Change the visibility timeout of a specified message in a queue (when received, the message remains in the queue upon it is deleted explicitly by the receiver)

SetQueueAttributes, GetQueueAttributes

– Control queue settings, get information about a queue

Valeria Cardellini – SDCC 2018/19 28

Amazon SQS: example

Valeria Cardellini – SDCC 2018/19

Example of application using SQS: online photo

processing service

http://bit.ly/2gwJFBw

29

SLIDE 16

Apache Kafka

General-purpose, distributed pub/sub system
Originally developed in 2010 by LinkedIn
Written in Scala
Horizontally scalable
High throughput

– Billions of messages

Fault-tolerant

Kreps et al., “Kafka: A Distributed Messaging System for Log Processing”, NetDB’11.

Valeria Cardellini – SDCC 2018/19 30

Delivery guarantees

– At least once: guarantees no loss, but duplicated packets, possibly out-

f-order

– From 2017, exactly once: guarantees no-loss and no- duplicates, but requires expensive end-to-end 2PC

Kafka at a glance

Kafka maintains feeds of messages in categories called

topics

Producers: publish messages to a Kafka topic
Consumers: subscribe to topics and process the feed of

published message

Kafka cluster: distributed log of data over servers known

as brokers

– Brokers rely on Apache Zookeeper for coordination

Valeria Cardellini – SDCC 2018/19 31

SLIDE 17

Kafka: topics

Topic: a category to which the message is published
For each topic, Kafka cluster maintains a partitioned log

– Log (data structure!): append-only, totally-ordered sequence of records ordered by time

Topics are split into a pre-defined number of partitions

– Partition: unit of parallelism of the topic

Each partition is replicated with some replication factor

Valeria Cardellini – SDCC 2018/19

> bin/kafka-topics.sh --create --zookeeper localhost: 2181 --replication-factor 1 --partitions 1 --topic test!

CLI command to create a topic with a single partition and one replica

32

Kafka: partitions

Each partition is an ordered, numbered, immutable

sequence of records that is continually appended to

– Like a commit log

Each record is associated with a sequence ID

number called offset

Partitions are distributed across brokers
Each partition is replicated for fault tolerance

Valeria Cardellini – SDCC 2018/19 33

SLIDE 18

Kafka: partitions

Each partition is replicated across a configurable

number of brokers

Each partition has one leader broker and 0 or more

followers

The leader handles read and write requests

– Read from leader – Write to leader

A follower replicates the leader and acts as a backup
Each broker is a leader for some of it partitions and a

follower for others to load balance

ZooKeeper is used to keep the brokers consistent

Valeria Cardellini – SDCC 2018/19 34

Kafka: partitions

Valeria Cardellini – SDCC 2018/19 35

SLIDE 19

Kafka: producers

Publish data to topics of their choice
Also responsible for choosing which record to assign

to which partition within the topic

– Round-robin or partitioned by keys

Producers = data sources

Valeria Cardellini – SDCC 2018/19

> bin/kafka-console-producer.sh --broker-list localhost: 9092 --topic test! This is a message! This is another message!

Run the producer

36

Kafka: consumers

Valeria Cardellini – SDCC 2018/19

Consumer Group: set of consumers sharing a common group ID

– A Consumer Group maps to a logical subscriber – Each group consists of multiple consumers for scalability and fault tolerance

Consumers use the offset to track which messages have been

consumed

– Messages can be replayed using the offset

Run the consumer

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning!

37

SLIDE 20

Kafka: ZooKeeper

Zookeeper: hierarchical, distributed key-value store

– Widely used coordination and synchronization service for large distributed systems – Often used for leader election (we’ll study Paxos as consensus algorithm) – Used in Kafka, Mesos, Storm, …

Kafka uses ZooKeeper to coordinate between the

producers, consumers and brokers

Valeria Cardellini – SDCC 2018/19 38

ZooKeeper stores Kafka

metadata

List of brokers
List of consumers and

their offsets

List of producers

Kafka: ordering guarantees

Messages sent by a producer to a particular topic

partition will be appended in the order they are sent

Consumer sees records in the order they are stored

in the log

Strong guarantees about ordering only within a

partition

– Total order over messages within a partition, but Kafka cannot preserve order between different partitions in a topic

Per-partition ordering combined with the ability to

partition data by key is sufficient for most applications

Valeria Cardellini – SDCC 2018/19 39

SLIDE 21

Kafka: fault tolerance

Replicates partitions for fault tolerance
Kafka makes a message available for

consumption only after all the followers acknowledge to the leader a successful write

– Implies that a message may not be immediately available for consumption

Kafka retains messages for a configured

period of time

– Messages can be “replayed” in case a consumer fails

Valeria Cardellini – SDCC 2018/19 40

Kafka: APIs

Four core APIs
Producer API: allows app to

publish streams of records to

ne or more Kafka topics
Consumer API: allows app to

subscribe to one or more topics and process the stream

f records produced to them
Connector API: allows building

and running reusable producers or consumers that

Valeria Cardellini – SDCC 2018/19

connect Kafka topics to existing applications or data systems so to move large collections of data into and

ut of Kafka

41

SLIDE 22

Kafka: APIs

Streams API: allows app to

act as a stream processor, transforming an input stream from one or more topics to an

utput stream to one or more
utput topics
Can use Kafka Streams to

process data in pipelines consisting of multiple stages

Valeria Cardellini – SDCC 2018/19 42

Kafka: client library

JVM internal client
Plus rich ecosystem of clients, among which:

– Sarama: Go library for Kafka

https://shopify.github.io/sarama/

– Python library for Kafka

https://github.com/confluentinc/confluent-kafka-python/

NodeJS client

https://github.com/Blizzard/node-rdkafka

Valeria Cardellini – SDCC 2018/19 43

SLIDE 23

Apache Kafka within Monasca

Monasca is a monitoring-as-a-service solution

integrated with OpenStack

– OpenStack: a set of software tools for building and managing Cloud platforms for public and private clouds

Monasca uses Kafka as message queue system

Valeria Cardellini – SDCC 2018/19 44

Protocols for MOM

Not only systems but also open standard protocols

for message queues

– AMQP (Advanced Message Queueing Protocol)

https://www.amqp.org
Binary protocol

– MQTT (Message Queue Telemetry Transport)

http://mqtt.org
Binary protocol

– STOMP (Simple (or Streaming) Text Oriented Messaging Protocol)

http://stomp.github.io
Text-based protocol
Goals:

– Platform- and vendor-agnostic – Provide interoperability between different MOMs

Valeria Cardellini – SDCC 2018/19 45

SLIDE 24

Messaging protocols and IoT

Often used in Internet of Things (IoT) projects

– Use a message queueing protocol to send data from sensors to services that will process those data – Exploit all the MOM advantages seen so far:

Decoupling
Resiliency: a MOM provides a temporary message storage
Traffic spikes handling: data will be persisted in MOM and

processed eventually

Valeria Cardellini – SDCC 2018/19 46

AMQP: characteristics

Open-standard protocol for MOM, supported by

industry

– Current version: 1.0

http://docs.oasis-open.org/amqp/core/v1.0/amqp-core-complete-v1.0.pdf

– Approved in 2014 as ISO and IEC International Standard

Binary, application-level protocol

– Based on TCP protocol with additional reliability mechanisms (at-most once, at-least once, exactly once delivery)

Programmable protocol

– Several entities and routing schemes are primarily defined by apps

Implementations

– Apache ActiveMQ, RabbitMQ, Apache Qpid, …

Valeria Cardellini – SDCC 2018/19 47

SLIDE 25

AMQP: model

The AMQP architecture involves three main actors:

– Publishers, subscribers, and brokers

AMQP entities (within the broker): queues, exchanges

and bindings

– Messages are published to exchanges (like post offices or mailboxes) – Exchanges then distribute message copies to queues using rules called bindings – Then AMQP brokers either deliver messages to consumers subscribed to queues, or consumers fetch/pull messages from queues on demand

Valeria Cardellini – SDCC 2018/19 48

https://bit.ly/2oP683F

AMQP: routing

Bindings:

– Direct exchange: delivers messages to queues based on the message routing key – Fanout exchange: delivers messages to all

f the queues that are

bound to it

Valeria Cardellini – SDCC 2018/19 49

SLIDE 26

AMQP: routing

Bindings:

– Topic Exchange: delivers messages to one or many queues based on topic matching

Often used to implement various publish/subscribe

pattern variations

Commonly used for the multicast routing of messages
Example use: distributing data relevant to specific

geographic location (e.g., points of sale)

– Headers Exchange: delivers messages based on multiple attributes expressed as headers

To route on multiple attributes that are more easily

expressed as message headers than a routing key

Valeria Cardellini – SDCC 2018/19 50

AMQP: messages

The AMQP protocol defines two types of messages:

– Bare messages, that are supplied by the sender – Annotated messages, that are seen at the receiver and are added by intermediaries during transit

The header conveys the delivery parameters

– Including durability requirements, priority, time to live

Valeria Cardellini – SDCC 2018/19 51

Annotated message

SLIDE 27

Comunicazione multicast

Comunicazione multicast: schema di comunicazione

in cui i dati sono inviati a molteplici destinatari

– Comunicazione broadcast: caso particolare della multicast, in cui i dati sono spediti a tutti i destinatari connessi in rete – Esempi di applicazioni multicast one-to-many: distribuzione di risorse audio/video, distribuzione di file – Esempi di applicazioni multicast many-to-many: servizi di conferenza, giochi multiplayer, simulazioni distribuite interattive

La tradizionale comunicazione one-to-one non scala

Unicast di un video a 1000 utenti Multicast di un video a 1000 utenti

Valeria Cardellini – SDCC 2018/19 52

Tipologie di multicast

Come realizzare il multicast?

– Multicast a livello di rete – Multicast a livello applicativo

Valeria Cardellini – SDCC 2018/19 53

SLIDE 28

Multicast a livello di rete

Replicazione dei pacchetti e routing gestiti dai router
Multicast a livello IP (IPMC) basato sui gruppi

– Generalizza UDP con trasmissione uno-a-molti – Gruppo: insieme di host interessati alla stessa applicazione multicast, identificati da uno stesso indirizzo IP

Indirizzo IP da 224.0.0.0 a 239.255.255.255 assegnato al

gruppo

– Protocollo IGMP (Internet Group Management Protocol) per il join al gruppo

Uso limitato per:

– Mancanza di supporto su larga scala (solo ~5% degli AS) – Problema di tener traccia dell’appartenenza ad un gruppo – Ad es. disabilitato in tutte le piattaforme Cloud a causa del problema del broadcast storm (aumento esponenziale del traffico di rete con possibile saturazione)

Valeria Cardellini – SDCC 2018/19 54

Multicast applicativo

Replicazione dei pacchetti e routing gestiti

dagli end host

Idea di base:

– Organizzare i nodi in una overlay network – Usare l’overlay network per diffondere le informazioni

Multicast applicativo:

– Strutturato

Creazione di percorsi di comunicazione espliciti

nell’overlay network

– Non strutturato

Basato su flooding
Basato su gossiping

Valeria Cardellini – SDCC 2018/19 55

SLIDE 29

Multicast applicativo strutturato

Come costruire in modo strutturato la rete
verlay?

– Albero

Unico percorso tra ogni coppia di nodi

– Mesh (rete a maglia)

Molti percorsi tra ogni coppia di nodi

Valeria Cardellini – SDCC 2018/19 56

Multicast applicativo strutturato: albero

Esempio: costruzione di un albero di multicast applicativo

in Scribe

– Scribe: sistema pub/sub con architettura decentralizzata e basato sulla DHT Pastry

1. Il nodo che inizia la sessione multicast genera l’identificatore del

gruppo di multicast (mid)

2. Cerca (tramite Pastry) il nodo responsabile per mid
3. Tale nodo diventa la radice dell’albero di multicast
4. Se il nodo P vuole unirsi all’albero di multicast identificato da mid

invia una richiesta di join

5. Quando la richiesta di join arriva al nodo Q
Q non ha mai ricevuto una richiesta di join per mid ⇒ Q diventa

forwarder, P diventa figlio di Q e Q inoltra la richiesta di join verso la radice

oppure Q è già un forwarder per mid ⇒ P diventa figlio di Q; non
ccorre inoltrare la richiesta di join alla radice
M. Castro et al., “Scribe: A large-scale and decentralised application-

level multicast infrastructure”, IEEE JSAC, 2002.

Valeria Cardellini – SDCC 2018/19 57

SLIDE 30

Multicast applicativo strutturato: albero

radice join() forwarder forwarder radice join() forwarder forwarder radice join() forwarder forwarder forwarder

Valeria Cardellini – SDCC 2018/19 58

Metriche di costo del multicast applicativo

Link stress: quante volte un messaggio di multicast

applicativo attraversa lo stesso collegamento fisico?

– Esempio: il messaggio da A a D attraversa <Ra,Rb> due volte

Stretch: rapporto tra il tempo di trasferimento

nell’overlay network e quello nella rete sottostante

– Esempio: i messaggi da B a C seguono un percorso con costo 71 a livello applicativo, ma 47 a livello di rete ⇒ stretch=71/47

Valeria Cardellini – SDCC 2018/19 59

SLIDE 31

Multicast applicativo non strutturato

Come realizzare il multicast applicativo

non strutturato?

– Flooding: già esaminato

Un nodo P invia il messaggio di multicast a tutti i

suoi vicini

A sua volta, ogni vicino (se non ha già visto il

messaggio) lo inoltrerà a tutti i suoi vicini (tranne P)

– Gossiping

Valeria Cardellini – SDCC 2018/19 60

Protocolli basati su gossiping

Protocolli di tipo probabilistico, detti anche di

gossiping o epidemici

– Essendo basati sulla teoria del gossip nelle reti sociali o della diffusione delle epidemie

Permettono la rapida diffusione delle informazioni in

reti a larghissima scala attraverso la scelta casuale dei destinatari successivi tra quelli noti al mittente

– Ogni nodo invia il messaggio ad un sottoinsieme, scelto casualmente, di nodi nella rete – Ogni nodo che lo riceve ne rinvierà una copia ad un altro sottoinsieme, anch’esso scelto casualmente, e così via

Valeria Cardellini – SDCC 2018/19 61

SLIDE 32

Le origini

Protocolli di gossiping definiti nel 1987 da Demers et al.

in un lavoro sulla garanzia di consistenza in database replicati su centinaia di server

Idea di base: assumendo che non vi siano conflitti di

scrittura (ovvero aggiornamenti indipendenti)

– Le operazioni di aggiornamento sono eseguite inizialmente su una o alcune repliche – Una replica comunica il suo stato aggiornato ad un numero limitato di vicini – La propagazione dell’aggiornamento è lazy (non immediata) – Al termine, ogni aggiornamento dovrebbe raggiungere tutte le repliche

A. Demers et al., “Epidemic Algorithms for Replicated Database Maintenance”,
Proc. 6th Symp. on Principles of Distributed Computing, 1987.

Valeria Cardellini – SDCC 2018/19 62

Why gossiping in large scale DSs?

Several attractive properties of gossip-based

information dissemination for large scale distributed systems

– Simplicity of gossiping algorithms – Lack of centralized control and bottlenecks – Scalability: each peer sends only a limited number

f messages, independently from the overall size
f the system

– Reliability and robustness: thanks to message redundancy

Valeria Cardellini – SDCC 2018/19 63

SLIDE 33

Where gossiping is used today?

Some examples:

– “Amazon uses a gossip protocol to quickly spread information throughout the S3 system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things.”

http://amzn.to/1MgDVsl

– Amazon’s Dynamo uses a gossip-based failure detection service – The basic information exchange in BitTorrent is based on gossip

Valeria Cardellini – SDCC 2018/19 64

Modelli di propagazione

Consideriamo due modelli di propagazione

– Gossiping puro e anti-entropia

Gossiping puro (rumor spreading): un peer che è

stato appena aggiornato (infettato) contatta un altro peer scelto casualmente inviandogli il proprio aggiornamento (infettandolo a sua volta)

Anti-entropia: periodicamente ciascun peer sceglie

casualmente un altro peer ed i due peer si scambiano gli aggiornamenti, giungendo al termine ad uno stato simile su entrambi

Valeria Cardellini – SDCC 2018/19 65

SLIDE 34

Gossiping puro

Un peer P che è stato appena aggiornato, contatta un

peer Q scelto a caso

Se Q ha già ricevuto l’aggiornamento (è già infetto), P

perde interesse a diffondere il gossip e con probabilità pari a 1/k smette di contattare altri peer

Se s è la frazione di peer non ancora aggiornati, si

dimostra che s = e−(k+1)(1−s)

Per garantire che un ampio numero di peer sia

aggiornato, occorre combinare il gossiping puro con l’anti-entropia

Al crescere di k aumenta la probabilità che l’aggiornamento si diffonda

Valeria Cardellini – SDCC 2018/19 66

Anti-entropia

Obiettivo: aumentare la similarità tra peer,

aumentando così “l’ordine” (il motivo del nome!)

Un peer P sceglie casualmente un altro peer Q nel

sistema; come lo aggiorna?

Tre strategie di aggiornamento:

– push: P invia soltanto i suoi aggiornamenti a Q – pull: P prende soltanto gli aggiornamenti da Q – push-pull: P e Q si scambiano reciprocamente gli aggiornamenti (dopodiché possiedono le stesse informazioni)

scelta dati

Valeria Cardellini – SDCC 2018/19 67

scelta dati scelta dati

SLIDE 35

Anti-entropia: prestazioni

Push-pull

– E’ la strategia più veloce – Impiega O(log N) round per propagare un aggiornamento agli N peer del sistema

Round (o ciclo) di gossip: intervallo di tempo in cui ogni

peer ha preso almeno una volta l’iniziativa di scambiare aggiornamenti

Valeria Cardellini – SDCC 2018/19 68

Schema generale di un protocollo di gossiping

Due peer P e Q, con P che ha scelto Q per lo scambio

di dati; P è eseguito una volta ad ogni round (ogni Δ unità di tempo)

Active thread (peer P): Passive thread (peer Q): (1) selectPeer(&Q); (1) (2) selectToSend(&bufs); (2) (3) sendTo(Q, bufs);

----> (3) receiveFromAny(&P, &bufr);

(4) (4) selectToSend(&bufs); (5) receiveFrom(Q, &bufr); <----- (5) sendTo(P, bufs); (6) selectToKeep(cache, bufr); (6) selectToKeep(cache, bufr); (7) processData(cache); (7) processData(cache)

Quali sono gli aspetti cruciali?

– La selezione dei peer – La selezione dei dati scambiati – Il processamento dei dati ricevuti

Riferimento: A.-M. Kermarrec, M. van Steen, “Gossiping in Distributed Systems”, ACM Operating System Review 41(5), Oct. 2007.

Valeria Cardellini – SDCC 2018/19 69

SLIDE 36

Implementare un protocollo di gossiping

Quali problemi specifici occorre affrontare nell’implementare un protocollo di gossiping?

Membership: come i peer possono conoscersi tra loro

e quanti conoscenti avere

Consapevolezza della rete: come fare in modo che i

collegamenti fra peer riflettano la topologia della rete, in modo da ottenere prestazioni soddisfacenti

Gestione dei buffer: quali informazioni scartare

quando la memoria del peer è piena

Filtraggio dei messaggi: come considerare l’interesse

per il messaggio da parte dei peer e ridurre la probabilità che ricevano informazioni a cui non sono interessati

Valeria Cardellini – SDCC 2018/19 70

Gossiping e flooding a confronto

La diffusione dell’informazione è l’applicazione

classica e più popolare del gossiping nei SD

– Valida alternativa rispetto al flooding

Nel caso di flooding

– Ogni peer che riceve il messaggio lo invia a tutti i suoi vicini (possiamo considerarlo una degenerazione del gossiping) – Il messaggio viene scartato quando il suo TTL diviene nullo

Round 1 Round 2 Round 3 Messaggi inviati: 18 Peer raggiunti: 8 su 9

Valeria Cardellini – SDCC 2018/19 71

SLIDE 37

Gossiping e flooding a confronto (2)

Nel caso di gossiping semplice

– Il messaggio viene inviato con una probabilità di gossiping p for each msg m if random(0,1) < p then send m

p p p p p p p p p p p Round 1 Round 2 Round 3 Messaggi inviati: 11 Peer raggiunti: 7 su 9

Valeria Cardellini – SDCC 2018/19 72

Gossiping vs flooding

Gossiping features

– Probabilistic – Takes a localized decision but results in a global state – Lightweight – Fault tolerant

Flooding has advantages

– Universal coverage and minimal state information

… but it floods the networks with redundant messages
Gossiping goals

– Reduce the number of redundant transmissions that occur with flooding while trying to retain its advantages – … but due to its probabilistic nature, gossiping cannot guarantee that all the peers are reached and it requires more time to complete than flooding

Valeria Cardellini – SDCC 2018/19 73

SLIDE 38

Altre applicazioni del gossiping nei SD

Oltre alla diffusione dell’informazione…
Peer sampling

– Per fornire a ciascun peer una lista di peer da contattare

Monitoraggio di risorse in sistemi distribuiti a larga

scala

Computazioni distribuite per l’aggregazione di dati, in

particolare in reti di sensori

– Computazione di valori aggregati (ad es. somma, media, massimo, quantili) – Ad es. nel caso di calcolo della media

Siano x0,i e x0,j i valori al tempo t=0 posseduti dai nodi i e j
Dopo il gossiping tra i e j usando strategia push-pull:

x1,i, x1,j ←(x0,i + x0,j)/2

Valeria Cardellini – SDCC 2018/19 74

Two gossiping protocols

We now examine two examples of gossiping

protocols

– Blind counter rumor mongering – Bimodal multicast

Valeria Cardellini – SDCC 2018/19 75

SLIDE 39

Blind counter rumor mongering

Why that name for this gossiping protocol?

– Rumor mongering (def: “the act of spreading rumours”, also known as gossip): a node with “hot rumor” will periodically infect other nodes – Blind: loses interest regardless of the recipient (why) – Counter: loses interest after F contacts (when) A node n initiates a broadcast by sending the message m to B of its neighbors, chosen at random. When node p receives a message m from node q If p has received m no more than F times p sends m to B uniformly randomly chosen neighbors that p knows have not yet seen m. – Note that p knows if its neighbor r has already seen the message m only if p has sent it to r previously, or if p received the message from r

Valeria Cardellini – SDCC 2018/19 76

Analysis of blind counter rumor mongering

Difficult to obtain analytical expressions to describe

the behavior of a gossiping protocol, even for relatively simple topologies simulation analysis

Assume Barabási network topology:

– 1000 nodes with an average node degree of 6 – Rumor mongering vs flooding scalability (F=2, B=2)

Source: “The cost of application-level broadcast in a fully decentralized peer-to-peer network”

Valeria Cardellini – SDCC 2018/19 77

SLIDE 40

Bimodal multicast

Also called pbcast (probabilistic broadcast)
Composed by two phases:
1. Message distribution phase: a process sends a

multicast with no particular reliability guarantees

IP multicast if available, otherwise some application-level

multicast (e.g., Scribe trees)

2. Gossip repair phase: after a process receives a

message, it begins to gossip about the message to a set of peers (called fanout)

Gossip occurs at regular intervals and offers the

processes a chance to compare their states and fill any gaps in the message sequence

Source: K.P. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky. Bimodal multicast. ACM Trans. Comput. Syst. 17, 2 (May 1999), 41-88.

Valeria Cardellini – SDCC 2018/19 78

Bimodal multicast: message distribution

Start by using unreliable multicast to rapidly distribute

the message

But some messages may not get through, and some

processes may be faulty

So initial state involves partial distribution of multicast(s)

Send messages

: failed messages

p1 p2 p3 p4 p5 p6

time

Valeria Cardellini – SDCC 2018/19 79

SLIDE 41

Bimodal multicast: gossip repair

Periodically (e.g., every 100 ms) each process

sends a digest describing its state to some randomly selected process

The digest identifies messages: it does not

include them Send digests p1 p2 p3 p4 p5 p6

Valeria Cardellini – SDCC 2018/19 80

Bimodal multicast: gossip repair (2)

Recipient checks the gossip digest against its
wn history and solicits a copy of any missing

message from the process that sent the gossip Solicit message copies p1 p2 p3 p4 p5 p6

Valeria Cardellini – SDCC 2018/19 81

SLIDE 42

Bimodal multicast: gossip repair (3)

Processes respond to solicitations received

during a round of gossip by retransmitting the requested message

Various optimizations (not examined)

Send message copies p1 p2 p3 p4 p5 p6

Valeria Cardellini – SDCC 2018/19 82

Bimodal multicast: why “bimodal”?

Are there two phases?
Nope; description of dual “modes” of result

Pbcast bimodal delivery distribution

1.E-30 1.E-25 1.E-20 1.E-15 1.E-10 1.E-05 1.E+00 5 10 15 20 25 30 35 40 45 50

number of processes to deliver pbcast p{#processes=k}

1. pbcast is almost always delivered to most or to few processes and almost never to some processes Atomicity = almost all or almost none 2. A second bimodal characteristic is due to delivery latencies, with

ne distribution of very

low latencies (messages that arrive without loss in the first phase) and a second distribution with higher latencies (messages that had to be repaired in the second phase)

Either sender fails… … or data gets through with high probability Valeria Cardellini – SDCC 2018/19 83