Allocating memory in a lock-free manner Anders Gidenstam, Marina - - PowerPoint PPT Presentation

allocating memory in a lock free manner
SMART_READER_LITE
LIVE PREVIEW

Allocating memory in a lock-free manner Anders Gidenstam, Marina - - PowerPoint PPT Presentation

Allocating memory in a lock-free manner Anders Gidenstam, Marina Papatriantafilou and Philippas Tsigas Distributed Computing and Systems group, Department of Computer Science and Engineering, Chalmers University of Technology Outline


slide-1
SLIDE 1

Allocating memory in a lock-free manner

Anders Gidenstam, Marina Papatriantafilou and Philippas Tsigas

Distributed Computing and Systems group, Department of Computer Science and Engineering,

Chalmers University of Technology

slide-2
SLIDE 2

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

2

Outline

Introduction Lock-free synchronization Memory allocators NBmalloc Architecture Data structures Experiments Conclusions

slide-3
SLIDE 3

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

3

Synchronization on a shared object

Lock-free and wait-free synchronization

Concurrent operations without enforcing mutual exclusion Avoids:

  • blocking and priority inversion

Lock-free

  • At least one operation always makes progress

Wait-free

  • All operations finish in a bounded number of their own steps

Synchronization primitives

Built into CPU and memory system

  • Atomic read-modify-write (i.e. a critical section of one instruction)
  • Examples
  • Test-and-set, Compare-and-Swap, Load-Linked / Store-Conditional
slide-4
SLIDE 4

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

4

Synchronization on a shared object

Desired semantics of a shared data

  • bject

Linearizability [Herlihy & Wing, 1990]

  • For each operation invocation there must

be one single time instant during its duration where the operation appears to take effect.

O2 O3 O1 O1 O2 O3

slide-5
SLIDE 5

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

5

Memory management and lock-free synchronization

Concurrent memory management Concurrent applications

  • Memory is a shared resource
  • Concurrent memory requests
  • Potential problems: contention, blocking, etc

Why lock-free?

  • Scalability/fault-tolerance potential
  • Prevents a delayed thread from blocking other threads
  • Scheduler decisions
  • Page faults etc
  • Many non-blocking algorithms uses dynamic memory allocation
  • => non-blocking memory allocator needed
slide-6
SLIDE 6

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

6

Memory Allocators

Provide dynamic memory to the application

Allocate / Deallocate interface

Maintains a pool of memory (a.k.a. heap) Online problem – requests are handled in order Performance

Fragmentation Runtime overhead

Memory address

slide-7
SLIDE 7

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

7

Concurrent Memory Allocators

Goals Scalability Avoiding

  • False-sharing
  • Threads use data in the same cache-line
  • Heap blowup
  • Memory freed on one CPU is not made available to the others
  • Fragmentation
  • Runtime overhead

Cache line CPUs

slide-8
SLIDE 8

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

8

The Hoard architecture [Berger et al, 2000]

size-classes Processor heap

SB SB SB SB

Processor heap

SB SB SB SB

Processor heap

SB SB SB

Processor heap

SB SB SB SB

Processor heap

SB SB SB SB

Processor heap

SB SB SB

Processor heap

SB SB SB SB

Processor heap

SB SB SB SB

Processor heap

SB SB SB

Processor heap

SB SB SB SB

Processor heap

SB SB SB SB

Processor heap

SB SB SB

SB header

Per-processor heaps

  • Threads running on different CPUs allocate

from different places

  • Avoids false-sharing and limits contention

Fixed set of size classes/allocatable sizes

  • Handled separately
  • Pros: Simple
  • Cons: Increases internal fragmentation

Superblocks

  • Contains blocks of one size class
  • Pros: Easy to transfer and reuse

memory, prevents heap blowup

  • Cons: External fragmentation
slide-9
SLIDE 9

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

9

The lock-free challenges

1.

The superblock internal freelist

2.

Moving and finding superblocks within a per- processor heap

3.

Returning superblocks to the global heap for reuse

  • Lock-free stack (a.k.a. IBM freelist [IBM, 1983])
  • New lock-free data structure: The flat-set.
  • Find an item in a set
  • Move an item between sets atomically
slide-10
SLIDE 10

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

10

Lock-free flat-sets

Lock-free container data structure

Properties Items can be moved from one

set to another atomically

An item can only be in one

“set” at a time

Operations Insert Get_any

Insert atomically removes the item from

its old location L-F Set L-F Set Remove Insert

Unless “Remove + Insert” appears atomic an item may get stuck in “limbo”.

Current

Flat-set

Superblock SB header

slide-11
SLIDE 11

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

11

Moving a shared pointer

Goal: Move a pointer value between two shared pointer locations Requirements The pointer target must stay accessible The same # of shared pointers to the target after the move

as before

Lock-free behaviour Issues One atomic CAS is not enough! We’ll need several steps. Interfering threads need to help unfinished operations

slide-12
SLIDE 12

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

12

From

Moving a shared pointer

To

New_pos

From

Old_pos

  • New_pos

To

Old_pos

From

New_pos

To

Old_pos

  • From

Note that some extra details are needed to prevent ABA problems.

slide-13
SLIDE 13

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

13

Experimental results

Benchmark applications Larson

  • Scalability
  • False-sharing

Active-false/Passive-false

  • Active false-sharing
  • Passive false-sharing
slide-14
SLIDE 14

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

14

Experimental results

Larson benchmark. Sun 4xUltraSPARC III

Speed-up Memory usage

slide-15
SLIDE 15

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

15

Experimental results

Larson benchmark. SGI Origin 3800 32(/128)xMIPS

Speed-up Memory usage

slide-16
SLIDE 16

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

16

Conclusions

Lock-free memory allocator

Scalable Behaves well on both UMA and NUMA architectures

Lock-free flat-sets

New lock-free data structure Allows lock-free inter-object operations Implementation Freely available (GPL)

slide-17
SLIDE 17

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

17

Future Work

Further development of the memory

allocator

Reclaiming superblocks for reuse in a

different size class

Improve search strategies for flat-sets Evaluate the memory allocator with real

applications

How to make lock-free composite objects

from “smaller” lock-free objects

slide-18
SLIDE 18

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

18

Questions?

Contact Information: Address:

Anders Gidenstam, Computer Science & Engineering, Chalmers University of Technology, SE-412 96 Göteborg, Sweden

Email:

andersg @ cs.chalmers.se

Web:

http://www.cs.chalmers.se/~dcs http://www.cs.chalmers.se/~andersg

Implementation

http://www.cs.chalmers.se/~dcs/nbmalloc.html

slide-19
SLIDE 19

2005 Anders Gidenstam, Distributed Computing and Systems, Chalmers

19

#CPUs #Threads Traditional desktop applications Traditional multi- threaded desktop applications Multi-threaded applications on new multicore CPU(s) High performance multi- threaded applications on multiprocessors

Concurrent applications

1 5