Transactional Memory Gokarna Sharma and Costas Busch Louisiana - - PowerPoint PPT Presentation
Transactional Memory Gokarna Sharma and Costas Busch Louisiana - - PowerPoint PPT Presentation
Towards Load Balanced Distributed Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University Euro- Par12, August 31, 2012 Distributed Transactional Memory (DTM) Transactions run on network nodes They ask for
Distributed Transactional Memory (DTM)
- Transactions run on network nodes
- They ask for shared objects distributed over the network
for either read or write
- The reads and writes on shared objects are supported
through three operations:
Publish Lookup Move
Predecessor node
Suppose the object ξ is at node and is a requesting node
ξ Requesting node Data-flow model: transactions are immobile and the objects are mobile
Read-only copy Main copy
Lookup operation
ξ ξ
Replicates the object to the requesting node
Read-only copy Main copy
Lookup operation
ξ ξ
Replicates the object to the requesting nodes
Read-only copy ξ
Main copy Invalidated
Move operation
ξ ξ
Relocates the object explicitly to the requesting node
Invalidated
Move operation
ξ
- Relocates the object explicitly to the requesting node
- Invalidates also the read-only copies (if available)
Main copy ξ Invalidated ξ
1
u
1
v
2
u
2
v
3
u
3
v
General routing: choose paths from sources to destinations
Routing in DTM: source node of the predecessor request in the total order is the destination of a successor request
Edge congestion
edge
C
maximum number of paths that use any edge
Node congestion
node
C
maximum number of paths that use any node
Length of chosen path Length of shortest path
u v
Stretch =
5 . 1 8 12 stretch
shortest path chosen path
Oblivious Routing
Each request path choice is independent
- f other request path choices
Problem Statement
- Given a d-dimensional mesh and a finite set of
- perations R ={r0,r1,…,rl} on an object ξ
- Design a DTM algorithm that:
– Minimizes congestion C = maxe |{i : 𝑞𝑗 ϶ e}| on any edge e – Minimizes total communication cost A(R) = σ𝑗=1
𝑚
|𝑞𝑗|
for all the operations
Related Work
Protocol Stretch Network Kind Runs on
Arrow [DISC’98] O(SST)=O(D) General Spanning tree Relay [OPODIS’09] O(SST)=O(D) General Spanning tree Combine [SSS’10] O(SOT)=O(D) General Overlay tree Ballistic [DISC’05] O(log D) Constant-doubling dimension Hierarchical directory with independent sets Spiral [IPDPS’12] O(log2 n log D) General Hierarchical directory with sparse covers
➢ D is the diameter of the network kind ➢ S* is the stretch of the tree used
Limitations and Motivations
- These protocols only minimize stretch and they cannot
control congestion
- Congestion can also be a major bottleneck
– may affect the overall performance of the algorithm
- A natural question is whether stretch and congestion can
be controlled simultaneously
- Congestion and stretch can not be minimized
simultaneously in arbitrary networks
Our Contributions
- MultiBend DTM algorithm for mesh networks
- For 2-dimensional mesh, MultiBend has both stretch
and (edge) congestion O(log n)
- For d-dimensional mesh, MultiBend has stretch O(d
log n) and (edge) congestion O(d2 log n)
- For fixed d,
– stretch is within O(log log n) factor and – congestion is within O(1) factor far from optimal
In the Remaining…
- Model
- General Approach
- Analogy to a Distributed Queue
- Hierarchical decomposition for MultiBend
- MultiBend Analysis
- Stretch
- Congestion
- Discussion
Model
- Mesh network G = (V,E) of n reliable nodes
- One shared object
- Nodes receive-compute-send atomically
- Nodes are uniquely identified
- Node u can send to node v if it knows v
- One node executes one request at a time
General Approach
Hierarchical clustering Network graph
Hierarchical clustering
Alternative representation as a hierarchy tree with leader nodes
At the lowest level (level 0) every node is a cluster
Directories at each level cluster, downward pointer if object locality known
Owner node
root
A Publish operation
➢ Assume that is the creator of which invokes the Publish operation ➢ Nodes know their parent in the hierarchy
ξ ξ
root
Send request to the leader
root
Continue up phase Sets downward pointer while going up
root
Continue up phase Sets downward pointer while going up
root
Root node found, stop up phase
root
A successful Publish operation Predecessor node ξ
Requesting node Predecessor node
root
Supporting a Move operation
➢ Initially, nodes point downward to object owner (predecessor node) due to Publish
- peration
➢ Nodes know their parent in the hierarchy
ξ
Send request to leader node of the cluster upward in hierarchy
root
Continue up phase until downward pointer found
root
Sets downward path while going up
Continue up phase
root
Sets downward path while going up
Continue up phase
root
Sets downward path while going up
Downward pointer found, start down phase
root
Discards path while going down
Continue down phase
root
Discards path while going down
Continue down phase
root
Discards path while going down
Predecessor reached, object is moved from node to node
root
Lookup is similar without change in the directory structure and only a read-only copy of the object is sent
Distributed Queue Analogy
Distributed Queue root u u
tail head
Distributed Queue root u u
tail head
v v
root u v w Distributed Queue u
tail head
v w
root u v w Distributed Queue
tail head
v w
root u v w Distributed Queue
tail head
w
Results on Mesh Networks
Type-1 Mesh Decomposition
2-dimensional mesh
Type-1 Mesh Decomposition
Type-1 Mesh Decomposition
Type-2 Mesh Decomposition
Type-2 Mesh Decomposition
Decomposition for 23x23 2-dimensional mesh
(i+1,2) (i+1,1) (i,2) (i,1)
Hierarchy levels
MultiBend Hierarchy
- Find a predecessor node via multi-bend paths for each
leaf node u
– by visiting leaders of all the clusters that contain u from level 0
to the root level root
u
p(u) p(v)
v
MultiBend Hierarchy (2)
- The hierarchy guarantees:
(1) For any two nodes u,v, their multi-bend paths p(u) and p(v) meet at level min{h, log(dist(u,v))+2} (2) length(pi(u)) is at most 2i+3 root
u
p(u) p(v)
v
(Canonical) downward Paths
root u
p(u)
root u
p(u)
v
p(v) p(v) is a (canonical) downward path
Load Balancing
- Through a leader election procedure
– Every time we access the leader of a sub-mesh, we replace it with another leader chosen uniformly at random among its nodes
- The directory is updated appropriately by updating parent and child
leaders
– Locking may needed in concurrent executions
- The update cost is low in comparison to the cost of serving requests
- This step is necessary to control congestion
– With fixed leader, edge congestion can be O(l), the number of requests
- If congestion requirement can be relaxed by a factor of ρ, the leader
change is needed after every ρ requests
Analysis of MultiBend
Analysis on (move) Stretch
Level Assume a sequential execution R of l+1 Move requests, where r0 is an initial Publish request.
A*(R) ≥ max1≤k≤h (Sk-1) 2k-1 A(R) ≥ σk=1
ℎ (Sk−1) 2k+3
C(R)/C*(R) = σk=1
ℎ (Sk−1) 2k+3 / max1≤k≤h (Sk-1) 2k-1
= 16 h max1≤k≤h (Sk-1) 2k-1 / max1≤k≤h (Sk-1) 2k-1 = O(log n)
h . . . k . . . 2 1
request x
r0 . . . r0 . . . r0 r0 r0 r1 . . r1 r1 r1
u v y w
r2 r2 r2 . . r2 r2 r2 rl-1 rl-1 rl-1 r2 . . rl . . . rl rl rl
. . . Thus,
Analysis on (Edge) Congestion
- A sub-path uses edge e with probability 2/ml
- P’: set of paths from M1 to M2 or vice-versa
- C’(e): Congestion caused by P’ on e
- E[C’(e)] ≤ 2|P’|/ml
- B ≥ |P’|/out(M1)
- ut(M1) ≤ 4ml
- C* ≥ B
==> E[C’(e)] ≤ 8C*
M2 M1 e ml
Assume M1 is a type-1 submesh
Analysis on (Edge) Congestion (2)
- As M1 at level (i,2) is always completely contained in M2 at level
(i+1,2)
- log n +2 levels
- E[C(e)] ≤ 8C*(log n + 2)
- Considering type-2 submeshes
– exactly one type-2 submesh between every two type-1 submeshes – the type-2 submeshes may not be proper subset of type-1 submeshes – 4 possible type-2 submesh choices (path may bend at most 2 times) – Increases the load by a factor of 4 Thus, using standard Chernoff bound, C = O(C* log n), w.h.p.
d-Dimensional Mesh
Extensions to d-dimensional mesh
- 2-dimesional decomposition can be directly
generalized to a d-dimensional mesh
– Problem: the congestion become O(2d log n)
- Another decomposition is used to control congestion
in O(C* d2 log n) and stretch O(d log n)
- We set appropriate λ and shift the type-1 submeshes
by (j-1)λ nodes in each dimension to get type-j submeshes
Extensions to d-dimensional mesh (2)
3-dimensional mesh decomposition. Only 2 of the 3 dimensions are shown
O(d)-types of submeshes on each level
O(d) sub-levels in every level with type-1 submesh first and type-O(d) submesh last in the order
Summary
- First load balanced distributed directory protocol with both
stretch and congestion O(log n) in 2-dimensional mesh
- In d-dimensional mesh, stretch O(d log n) and (edge)
congestion O(d2 log n)
- MultiBend is starvation-free and provides linearizability.
- Future work: extend the results to
– dynamic networks – make it fault-tolerant