Invited Talk:
Epidemic Protocols for Extreme-scale Computing
- Dr. Giuse
seppe pe Di Fa Fatta ta
G.DiFatta@reading.ac.uk
Wednesday, September 24, 2014
Global Knowledge without Global Communication
Extreme-scale Computing Global Knowledge without Global - - PowerPoint PPT Presentation
Invited Talk: Epidemic Protocols for Extreme-scale Computing Global Knowledge without Global Communication Dr. Giuse seppe pe Di Fa Fatta ta G.DiFatta@reading.ac.uk Wednesday, September 24, 2014 Outline e X tre reme-sca scale le
G.DiFatta@reading.ac.uk
Wednesday, September 24, 2014
Global Knowledge without Global Communication
2
communication
exascale supercomputing (HPC)
– Ubiquitous Computing, Crowd Sensing, P2P Overlay Networks – Internet of Things (50 to 100 trillion objects) – Decentralised Online Social Networks
– Large-scale Wireless Sensor Networks – Mobile ad-hoc Networks (MANET) – Vehicular Ad-Hoc Networks (VANET)
– Tianhe-2 (MilkyWay-2): National Supercomputer Center, Sun Yat-sen University, Guangzhou, China, Top500 N.1 since June 2013, 34/55 Pflop/s, 3.12M cores
– number of data objects – dimensionality of data objects – number of processing elements
– Scalability of the communication cost – Decentralisation – Robustness and fault-tolerance – Adaptiveness: ability to cope with dynamic environments
4
p
Communi nication
bound nd
– high scalability – probabilistic guarantees on convergence speed and accuracy – robustness, fault-tolerance, high stability under disruption
Figure from: “Rapid communications A preliminary estimation of the reproduction ratio for new influenza A(H1N1) from the outbreak in Mexico, March-April 2009", P Y Boëlle, P Bernillon, J C Desenclos, Eurosurveillance, Volume 14, Issue 19, 14 May 2009
cases exceed a "normal" expectation of propagation (a contained propagation).
– The disease spreads person-to-person: the affected individuals become independent reservoirs leading to further exposures. – In uncontrolled outbreaks there is an exponential growth of the infected cases.
Figure from: “Controlling infectious disease outbreaks: Lessons from mathematical modelling”, T Déirdre Hollingsworth, Journal of Public Health Policy 30, 328-341, Sept. 2009
7
Di Disease se outbre break ak Epidemi mic c commun municat cation n for extr treme eme-scale scale compu puti ting ng
uniformly at random
function in a practical way.
8
– wait some T – chose a random peer – send local state
– receive remote state – If state==infected, then local state=infected Active thread (cycle-based): Passive thread (event-based):
9
Time to complete “infection”: O(log N)
expected # protocol cycles # peers
– several epidemic protocols and – their applications in communication networks and distributed systems
have received limited attention.
simplistic synchronous communication model with static/reliable network unrealistic global knowledge of the networked system the initial overlay topology is a random graph unlimited or “enough” protocol rounds to reach convergence
may have serious implications in convergence speed and, even worse, in the convergence guarantee itself
typically are not known locally.
11
fault-tolerant services, such as:
– information dissemination (broadcast, multicast) – data aggregation: values of aggregate functions more important than individual data (sum, average, sampling, percentiles, etc.)
– DB replica synchronisation and maintenance – Network management and monitoring – Failure detection – HPC algs and services, e.g., QR factorization and power-capping – Epidemic Knowledge Discovery and Data Mining
13
14
distributed data All-Reduce distributed processes centroids for next iteration: repeat until convergence compute local clusters: partial sums Broadcast generate centroids for first iteration
data are intrinsically distributed
compute local clusters: partial sums compute local clusters: partial sums compute local clusters: partial sums
initialisation
P0 P1 P2 P3
Global communication is not a feasible approach for extreme-scale systems
15
distributed data Epidemic Aggregation of sums, counts and errors distributed processes centroids for next iteration: repeat until convergence compute local clusters: partial sums Epidemic broadcast
generate centroids for first iteration
data are intrinsically distributed
compute local clusters: partial sums compute local clusters: partial sums compute local clusters: partial sums
initialisation
P0 P1 P2 P3 generate centroids for first iteration generate centroids for first iteration generate centroids for first iteration
(or static list of seeds for multiple executions)
dependent allocation (d).
16
17
Clustering Accuracy (average)
Cluster distribution (Jain Index)
skew data distribution uniform distribution
epidemic random p2p local p2p
Standard Deviation
Cluster distribution (Jain Index)
skew data distribution uniform distribution
epidemic random p2p local p2p
18
Clustering Error (average)
Cluster distribution (Jain Index)
skew data distribution uniform distribution
epidemic random p2p local p2p
Standard Deviation
Cluster distribution (Jain Index)
skew data distribution uniform distribution
epidemic random p2p local p2p
19
Clustering Error (average)
Cluster distribution (Jain Index)
skew data distribution uniform distribution
epidemic random p2p local p2p
Standard Deviation
Cluster distribution (Jain Index)
skew data distribution uniform distribution
epidemic random p2p local p2p
– sum, average, max, min, random samples, quantiles and
22
additions, must be serialized: O(N)
binary tree: the number of communication steps is reduced from O(N) to O(log(N)).
1 N i i
23
1 1 N
x0 x1 x2 x3 x4 x5 x6 x7
Any global function which can be approximated well using linear combinations.
t1 t0 t2 t3 time
24
be tolerated.
nodes.
node de failur ilure
(membership protocol)
26
– wait some T – chose a random peer – send local state
– receive remote state
– [reply with local state]
– merge remote and local state Active thread (cycle-based): Passive thread (event-based):
Membership Protocol Aggregation Protocol
– Worst case analysis: peak distribution, i.e. information originated at one node
27
Very high value Higher value Target value (0.01% error) Lower value
– each peer sends state to other member
– each peer requests state from other member – expected #rounds the same
– Push and Pull in one exchange – reduces #rounds at increased communication cost
28
Asymm ymmetric etric Gossi ssiping ping Symmetric mmetric Goss ssiping iping
number x of messages follows a binom
ial distri stribu bution ion.
message.
29
30
– Node i sends the pair <xi,w0,i> to itself.
z i j
<½st,j, ½wt,j>
u
<½st,i, ½wt,i>
st+1,i = ½st,j + ½st,i + ½st,z
wt+1,i = ½wt,j + ½wt,i + ½wt,z
<½st,i, ½wt,i> variance reduction step
31
32
– The number of protocol iterations such that the value at a node is diffused through the network, i.e., a peak distribution is transformed in a uniform distribution. – The diffusion speed is typically given as the complexity of the number of iteration steps as function of the network size, maximum error and maximum probability that the approximation at a node is larger than the maximum error.
33
the approximation of the global aggregate is within , in at most O(log(N) + log(1/) + log(1/)) cycles, where and are arbitrarily small positive constants.
34
and averaged.
– Node i selects a random node j to exchange their local values. – Each node compute the average and updates the local pair.
– If not, the conservation of mass in the system is not guaranteed and the protocol does not converge to the true global aggregate.
i j i j 1 2 4 u 3
vt+1,i = ½(vt,j + vt,i) vt+1,j = ½(vt,j + vt,i) variance reduction step:
2 1
35
– no atomic operation is required.
j i
<½st,i, ½wt,i> <½st,j, ½wt,j> <½st,j, ½wt,j> <½st,i, ½wt,i>
36
– Percentage of operations with atomicity violation (AVP): 0.3% and 90%, – Internet-like topologies, 5000 nodes. – PPG and SPSP convergence speed is similar w.r.t. AVP.
PPG PSP SPSP
37
– different AVP levels (from 0.3% to 90%) – averages over 100 different simulations: Internet-like and mesh topologies, 1000-5000 nodes, different data distributions, asynchronous communication. – Only PSP and SPSP converge to the true global aggregate value.
PPG PSP SPSP
Transport Protocol
39
Membershi ership p Protoc
Aggregation Protocol Aggregation Protocol Epidemic application Aggregation Protocol Epidemic application Network Protocol overlay topology physical topology Uniform Gossiping
– exchange information with other nodes to achieve some application goals (e.g., information dissemination, data aggregation)
– provide the random peer sampling service for the above and is based
40
Epidemic emic Proto tocol col Membersh ership p Protoco tocol
request a random node response with random node j send a push msg to j
3 2 1
Epidemic emic Proto tocol col node j node i
– Sparse (out degree): e.g., a fully connected graph is not scalable (global knowledge) – Robust: no single points of failure - a star topology has optimal propagation time, but it is not scalable and is not robust. – Load balancing (indegree): there should not be bottlenecks. – Connectivity: a single connected component
into multiple connected components, it will not heal (*) and the application-layer epidemic protocol will not converge.
– Good propagation/diffusion: random graphs, expanders
41
– Partial view of the global system: a local cache of (max size) peer IDs is maintained and used to draw a random entry when requested
randomly trimmed to max size.
– This is equivalent to multiple random walks: the cache entries quickly converges to a random sample of the peers with uniform distribution.
42
Membersh ership p Protoco tocol Membersh ership p Protoco tocol
node e i node e j push pull a b j c d e f g
a b j c d e f g
– Partial view of the global system: a local cache of (max size) peer IDs is maintained and used to draw a random entry when requested
randomly trimmed to max size.
– This is equivalent to multiple random walks: the cache entries quickly converges to a random sample of the peers with uniform distribution.
43
Membersh ership p Protoco tocol Membersh ership p Protoco tocol
node e i node e j d b j e a d f g
transient random overlay network.
– The membership protocol keeps changing the overlay topology over time – Aim: the random node sampling from the local partial view results in a uniform distribution over the global system Uniform Gossiping – The node caches define an overlay network topology:
44
…
Some membership protocols:
– it takes an input graph and generates a random output graph with similar properties. (assuming a simplified synchronous network model)
graphs with equal indegree (or with low variance): the indegree can be used as a measure of robustness.
45
robust and strongly connected digraphs digraphs
weakly connected digraphs
Multipl ple e connec nnected ed components ponents
2+ 1 1
initial condition
expander graphs, aka ‘expanders’.
– The strong connectivity can be quantified by an index of expansion quality.
the expansion quality of the overlay topology.
– quasi-random peer selection: random search of a push-pull peer that minimizes cache overlap.
46
different sample sizes (typically 0<s<½|V|):
47
Set of the network nodes Boundary Nodes Sample
S V S S V h \ ) ( ) , (
V: the set of network nodes S: a sample of nodes, S V, |S|=s (S): the boundary of S,
i.e. the set of nodes not in S and 1-hop distant from at least one node in S.
S V S s V h
s S V S
\ ) ( min ) , (
, min
– Qx is local cache of node x and Qy is local cache of node y. – Each iteration node x will send push message to node y. – If |Qx ⋂ Qy| <= Tmax, then y will accept the push message and reply with pull message.
48 x y 1 2 3
Push Msg Pull Msg Accept
– Each iteration node x will send push message to randomly selected node y. – If |Qx ⋂ Qy| > Tmax, then y will forward the push message to another randomly selected node from Qy and repeat the same step until the message is accepted.
49 1 2 3 4 5 x y z
Push Msg Forwarding Push Msg Pull Msg Accept Reject ……………
– In order to prevent excessive communication overhead and delay, the forwarding procedure will be repeated up to Hmax, then the message will return to the node with lowest similarity and force it to accept the message.
50 1 2 3 4 6 x y z
Push Msg Forwarding Push Msg Pull Msg Reject and Hop = Hmax Reject
i 5 Force Accept
– the initial condition is particularly poor (e.g., ring of communities) and – in the presence of interleaving in push-pull operations
recovery from multiple connected components back to a single one.
– Work in progress: only limited experimental verification
– Interleaving causes unwanted duplication of cache entries
is made. – Some selected entries rather than dropped are stored in a secondary cache. – When local duplication is detected, then entries from secondary cache is recovered.
51
Protocol, Expander Protocol, Ideal Random
52
circular regular graph ring of communities
– init: circular regular graph – chart: minimum vertex expansion index: hmin(G,5%|V|)
53
– init: ring of communities – chart: minimum vertex expansion index: hmin(G,5%|V|)
54
– init: circular regular graph – chart: convergence speed
55
– init: ring of communities – chart: max number (over several trials) of connected components vs #cycle
56
– 10K nodes, peak distribution, 5 Aggregation protocols, init random graph – Chart: number of nodes (%) locally converged to the global aggregate within a tolerance error for different accuracy thresholds (stddev).
58
cycle % node
1 2 3 1
– local variance is used to detect convergence global bal syn ynchro chronis nisati ation
withou
bal commun
ication ion
59
Membership Protocol … Aggregation Protocol #k Global Synchronisation Aggregation Protocol #1 Aggregation Protocol #2
application requirements.
60
available in simulations.
2 3 2 3
61
2 3
transition period.
have detected global convergence for different methods.
62
3
– fully decentralised – fault-tolerant – suitable for extreme-scale networked systems – suitable for asynchronous and dynamic networks
– Symmetric Push-Sum Protocols (SPSP), an aggregation protocol – The Expander Membership Protocol – Methods of global convergence detection (synchronisation) – Epidemic K-Means, the first epidemic data mining algorithm
– Refining and extending the Expander Membership Protocol: incorporating a connectivity recovery mechanism
63
convergence detection and synchronisation
tree induction, recommender systems, etc.
connectivity
strategy
64
1.
mmet etric ric Push-Sum Sum Prot
l for
entra ralis lised d Aggre gregat gatio ion", The International Conference on Advances in P2P Systems (AP2PS), Lisbon, Portugal,
11. 2.
idemic K-Mea eans ns Cluster ering ing", IEEE ICDM Workshop on Knowledge Discovery Using Cloud and Distributed Computing Platforms (KDCloud), Vancouver, Canada, 11 Dec. 2011 11. 3.
ult tolera
nt decent ntra ralis ised d k-Mea eans ns clus uster ering ing for
async nchr hron
rge-scale le networ
Issue 3, March 2013 13, Pages 317–329. 4.
nverg rgenc ence e Detec ection ion in Epidem idemic ic Aggreg gregat ation ion", Proc. of Euro-Par 2013 Workshops, Aachen, Germany, Aug. 26-30, 2013 13, Springer LNCS. 5.
pansio ion n Qualit ality of Epide demic ic Prot
International Symposium on Intelligent Distributed Computing (IDC), Madrid, Spain, Sept. 3-5, 2014 14, Studies in Computational Intelligence, Springer, Vol. 570, 2015, pp 291-300.
65
66
– Nicholas C. Grassly & Christophe Fraser, "Mathematical models of infectious disease transmission, Nature Reviews Microbiology 6, 477-487 (June 2008)
–
maintenance, in: Proceedings of the sixth annual ACM Symposium on Principles of distributed computing, PODC ’87, ACM, 1987 1987, pp. 1–12. –
2000, pp. 565–. – Eugster, P.T.; Guerraoui, R.; Kermarrec, A.-M.; Massoulie, L.; , "Epidemic information dissemination in distributed systems," Computer , vol.37, no.5, pp. 60- 67, May 2004 2004.
–
Foundations of Computer Science, 2003 2003, pp. 482 – 491. –
2005, 219–252. –
2006, 2508 – 2530. –
Advances in P2P Systems (AP2PS), Lisbon, Portugal, Nov. 20-25, 2011. 2011.
– Samir Khuller, Yoo-Ah Kim, and Yung-Chun Wan, "On generalized gossiping and broadcasting", Journal of Algorithms, 59, 2, May 2006 2006, 81-106. – “Dependability in aggregation by averaging,” P. Jesus, C. Baquero, and P. Almeida, 1st Symposium on Informatics (INForum 2009), Sept. 2009 2009,
– Rafik Makhloufi, Gregory Bonnet, Guillaume Doyen, and Dominique Gaiti, "Decentralized Aggregation Protocols in Peer-to-Peer Networks: A Survey", The 4th IEEE International Workshop on Modelling Autonomic Communications Environments (MACE), 2009 2009. –
2009,
– Philip Soltero, Patrick Bridges, Dorian Arnold, and Michael Lang, “A gossip-based approach to exascale system services”, Proc. of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers (ROSS '13), ACM, 2013 2013.