Previous Lecture Slides for Lecture 25 ENCM 501: Principles of - - PDF document

previous lecture slides for lecture 25
SMART_READER_LITE
LIVE PREVIEW

Previous Lecture Slides for Lecture 25 ENCM 501: Principles of - - PDF document

slide 2/17 ENCM 501 W14 Slides for Lecture 25 Previous Lecture Slides for Lecture 25 ENCM 501: Principles of Computer Architecture Winter 2014 Term MSI protocol for cache concurrency Steve Norman, PhD, PEng snooping to support MSI and


slide-1
SLIDE 1

Slides for Lecture 25

ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng

Electrical & Computer Engineering Schulich School of Engineering University of Calgary

10 April, 2014

ENCM 501 W14 Slides for Lecture 25

slide 2/17

Previous Lecture

◮ MSI protocol for cache concurrency ◮ snooping to support MSI and similar protocols ◮ race conditions at ILP and TLP levels ◮ use of locks to manage TLP races

ENCM 501 W14 Slides for Lecture 25

slide 3/17

Today’s Lecture

◮ introduction to ISA support for locks ◮ comments on Pthreads programming ◮ some general remarks about computer architecture and

software development Related reading in Hennessy & Patterson: Section 5.5

ENCM 501 W14 Slides for Lecture 25

slide 4/17

Using locks to manage access to shared data

Thread A: while ( condition ) { do some work acquire lock counter++; release lock } Thread B: while ( condition ) { do some work acquire lock counter++; release lock } Correct updates to counter will be done by both threads, because only one thread can have the lock at any given time. If A has the lock and B tries to acquire the lock, B will have to wait.

ENCM 501 W14 Slides for Lecture 25

slide 5/17

Example of locking with a Pthreads mutex

This C code sets up a global variable of type pthread_mutex_t, and initializes it to the unlocked state . . .

pthread_mutex_t the_lock = PTHREAD_MUTEX_INITIALIZER;

Functions pthread_mutex_lock and pthread_mutex_unlock can be used to acquire and release the lock, as seen on the next slide. Note the abstraction here: The interface is designed to hide ISA and microarchitecture details from the programmer.

ENCM 501 W14 Slides for Lecture 25

slide 6/17 pthread_mutex_t the_lock = PTHREAD_MUTEX_INITIALIZER; void * foo(void * arg) { while (/* some condition */) { /* Do some work. */ // Get the lock. // If another thread has the lock, wait. pthread_mutex_lock(&the_lock); /* Access shared memory. */ // Release the lock. pthread_mutex_unlock(&the_lock); } /* ... */ }

slide-2
SLIDE 2

ENCM 501 W14 Slides for Lecture 25

slide 7/17

How NOT to set up a lock

int my_lock = 0; // 0: unlocked; 1: locked. void get_lock(void) { // Keep trying until my_lock == 0. while (my_lock != 0) ; // Aha! The lock is available! my_lock = 1; }

Why is this approach totally useless?

ENCM 501 W14 Slides for Lecture 25

slide 8/17

ISA and microarchitecture support for concurrency

Instruction sets have to provide special atomic instructions to allow software implementation of synchronization facilities such as mutexes (locks) and semaphores. An atomic RMW (read-modify-write) instruction (or a sequence of instructions that is intended to provide the same kind of behaviour, such as MIPS LL/SC) typically works like this:

◮ memory read is attempted at some location; ◮ some kind of write data is generated; ◮ memory write to the same location is attempted.

ENCM 501 W14 Slides for Lecture 25

slide 9/17

The key aspects of atomic RMW instructions are:

◮ the whole operation succeeds or the whole operation fails,

in a clean way that can be checked after the attempt was made;

◮ if two or more threads attempt the operation, such that

the attempts overlap in time, one thread will succeed, and all the other threads will fail.

ENCM 501 W14 Slides for Lecture 25

slide 10/17

MIPS LL and SC instructions

LL (load linked:) This is like a normal LW instruction, but it also gets the processor ready for upcoming SC instruction. SC (store conditional:) The assembler syntax is SC GPR1 ,

  • ffset ( GPR2 )

If SC succeeds, it works like SW, but also writes 1 into GPR1 . If SC fails, there is no memory write, and GPR1 gets a value

  • f 0.

The hardware ensures that if two or more threads attempt LL/SC sequences that overlap in time, SC will succeed in

  • nly one thread.

ENCM 501 W14 Slides for Lecture 25

slide 11/17

Use of LL and SC to lock a mutex

Suppose R9 points to a memory word used to hold the state of a mutex: 0 for unlocked, 1 for locked. Here is code for MIPS, with delayed branch instructions. L1: LL R8, (R9) BNE R8, R0, L1 ORI R8, R0, 1 SC R8, (R9) BEQ R8, R0, L1 NOP Let’s add some comments to explain how this works. What would the code be to unlock the mutex?

ENCM 501 W14 Slides for Lecture 25

slide 12/17

Spinlocks

The example on the last slide demonstrates spinning in a loop to acquire a lock. Suppose Thread A is spinning, waiting to acquire a lock. Then Thread A is occupying a core, using energy, and not really doing any work. That’s fine if the lock will soon be released. However, if the lock may be held for a long time, a more sophisticated algorithm is better:

◮ Thread spins, but gives up after some fixed number of

iterations.

◮ Thread makes system call to OS kernel, asking to sleep

until the lock is available.

slide-3
SLIDE 3

ENCM 501 W14 Slides for Lecture 25

slide 13/17

SC and similar instructions have long latencies

In a multicore system, SC, and instructions that are intended for similar purposes in other ISAs—see, for example, CMPXCHG in x86 and x86-64—will necessarily have to inspect some shared global state to determine success or failure. There is no way to make a safe decision about SC simply by looking within a private cache with one core! Execution of SC by a core must therefore cause a many-cycle stall in that core. It’s only one instruction, but that doesn’t mean it’s cheap in terms of time!

ENCM 501 W14 Slides for Lecture 25

slide 14/17

Locks aren’t free, but are often necessary

It should be clear that any kind of variable or data structure that could be written to by two or more threads must be protected by a lock. Consider a program in which only one thread writes to a variable or data structure, but many threads read that variable

  • r data structure from time to time.

Why might a lock be necessary in this one-writer, many-reader case? What kind of modification to the lock design might improve program efficiency?

ENCM 501 W14 Slides for Lecture 25

slide 15/17

Comments on Pthreads programming

There’s not much time left, so I won’t try to provide any new details beyond what appeared in assignments and a tutorial. Some very general comments:

◮ Pthreads was chosen for thread examples in ENCM 501

because the basics can be understood if you have reasonable understandings of C and the concept of virtual address spaces for processes.

◮ To be effective with Pthreads, you need to learn a lot

more than the just the very basics given in ENCM 501.

◮ There are a lot of alternatives and a lot of innovation

currently underway—if your application needs threads, do research on multiple languages and libraries!

ENCM 501 W14 Slides for Lecture 25

slide 16/17

Useful quotes

This course has mostly been about hardware designs to enhance instruction throughput, and a little about tradeoffs with energy use, chip area, and so on. But that does not mean that speed is important in every part of every system . . . “The best performance improvement is the transition from the nonworking state to the working state.” —John Ousterhout “Premature optimization is the root of all evil (or at least most of it) in programming.” —Donald Knuth A little Web search regarding the Knuth quote is very worthwhile—what does it mean, exactly, and what does it not say?

ENCM 501 W14 Slides for Lecture 25

slide 17/17

Some of the things I hope you take away from this course

Better appreciation for the intense design efforts that have given us today’s astonishingly powerful, complex, and cheap processor chips and memory systems. A starting point for research on ways to get maximum performance out of current and future hardware, when maximum performance is needed. A starting point for research, if you are trying to choose the best processor and memory system for a high-performance embedded application.