[PPT] - Simultaneous Multi- Threaded Design Virendra Singh Associate PowerPoint Presentation

SLIDE 1

CADSL

Simultaneous Multi- Threaded Design

Virendra Singh

Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay

http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in

EE-739: Processor Design

Lecture 34 (09 April 2013)

SLIDE 2

CADSL

Program vs Process

Program is a passive entity which specifies the

logic of data manipulation and IO action

Process is an active entity which performs the

actions specified in a program

Multiple execution of a program process leads to

concurrent processes

2

09 Apr 2013 EE-739@IITB 2

SLIDE 3

CADSL

Process

Process is a program in execution that can be in

a number of states

– running, waiting, ready, terminated

Process creation

– fork() and exec() system calls

Inter-process communications

– Shared memory, and message passing

Client-server communication

– Socket, RPC, RMI

3

09 Apr 2013 EE-739@IITB 3

SLIDE 4

CADSL

Threads

A thread (a lightweight process) is a basic unit of CPU

utilization.

A thread has a single sequential flow of control.
A thread is comprised of: A thread ID, a program counter, a

register set and a stack.

A process is the execution environment in which threads

run. – (Recall previous definition of process: program in execution).

The process has the code section, data section, OS

resources (e.g. open files and signals).

Traditional processes have a single thread of control
Multi-threaded processes have multiple threads of control

– The threads share the address space and resources of

09 Apr 2013 EE-739@IITB 4

SLIDE 5

CADSL

Single and Multithreaded Processes

5

Threads encapsulate concurrency: “Active” component Address spaces encapsulate protection: “Passive” part Keeps buggy program from trashing the system

09 Apr 2013 EE-739@IITB 5

SLIDE 6

CADSL

Processes vs. Threads

Which of the following belong to the process and which to the thread? Program code: local or temporary data: global data: allocated resources: execution stack: memory management info: Program counter: Parent identification: Thread state: Registers: Process Thread Process Process Thread Process Thread Process Thread Thread

09 Apr 2013 EE-739@IITB 6

SLIDE 7

CADSL

Control Blocks

The thread control block (TCB) contains:

– Thread state, Program Counter, Registers

PCB' = everything else (e.g. process id,
pen files, etc.)
The process control block (PCB) = PCB' U

TCB

09 Apr 2013 EE-739@IITB 7

SLIDE 8

CADSL

Why use threads?

Because threads have minimal internal state,

it takes less time to create a thread than a process (10x speedup in UNIX).

It takes less time to terminate a thread.
It takes less time to switch to a different

thread.

A multi-threaded process is much cheaper

than multiple (redundant) processes.

09 Apr 2013 EE-739@IITB 8

SLIDE 9

CADSL

Examples of Using Threads

Threads are useful for any application with multiple tasks

that can be run with separate threads of control.

A Word processor may have separate threads for:

– User input – Spell and grammar check – displaying graphics – document layout

A web server may spawn a thread for each client

– Can serve clients concurrently with multiple threads. – It takes less overhead to use multiple threads than to use multiple processes.

09 Apr 2013 EE-739@IITB 9

SLIDE 10

CADSL

Examples of multithreaded programs

Most modern OS kernels

– Internally concurrent because have to deal with concurrent requests by multiple users – But no protection needed within kernel

Database Servers

– Access to shared data by many concurrent users – Also background utility processing must be done

Parallel Programming (More than one physical

CPU) – Split program into multiple threads for

09 Apr 2013 EE-739@IITB 10

SLIDE 11

CADSL

Multithreaded Matrix Multiply...

X A = B C

C[1,1] = A[1,1]*B[1,1]+A[1,2]*B[2,1].. …. C[m,n]=sum of product of corresponding elements in row of A and column of B.

Each resultant element can be computed independently.

09 Apr 2013 EE-739@IITB 11

SLIDE 12

CADSL

Multithreaded Matrix Multiply

typedef struct { int id; int size; int row, column; matrix *MA, *MB, *MC; } matrix_work_order_t; main() { int size = ARRAY_SIZE, row, column; matrix_t MA, MB,MC; matrix_work_order *work_orderp; pthread_t peer[size*zize]; ...

09 Apr 2013 EE-739@IITB 12

SLIDE 13

CADSL

Multithreaded Matrix Multiply

/* process matrix, by row, column */ for( row = 0; row < size; row++ ) for( column = 0; column < size; column++) { id = column + row * ARRAY_SIZE; work_orderp = malloc( sizeof(matrix_work_order_t)); /* initialize all members if wirk_orderp */ pthread_create(peer[id], NULL, peer_mult, work_orderp); } } /* wait for all peers to exist*/ for( i =0; i < size*size;i++) pthread_join( peer[i], NULL ); }

09 Apr 2013 EE-739@IITB 13

SLIDE 14

CADSL

Benefits

Responsiveness:

– Threads allow a program to continue running even if part is blocked. – For example, a web browser can allow user input while loading an image.

Resource Sharing:

– Threads share memory and resources of the process to which they belong.

Economy:

– Allocating memory and resources to a process is costly. – Threads are faster to create and faster to switch between.

Utilization of Multiprocessor Architectures:

– Threads can run in parallel on different processors. – A single threaded process can run only on one processor no matter how many are available.

09 Apr 2013 EE-739@IITB 14

SLIDE 15

CADSL

09 Apr 2013 EE-739@IITB 15

Thread Level Parallelism (TLP)

ILP exploits implicit parallel operations within a

loop or straight-line code segment

TLP explicitly represented by the use of

multiple threads of execution that are inherently parallel

Goal: Use multiple instruction streams to

improve

1. Throughput of computers that run many programs
2. Execution time of multi-threaded programs
TLP could be more cost-effective to exploit

than ILP

SLIDE 16

CADSL

09 Apr 2013 EE-739@IITB 16

New Approach: Mulithreaded Execution

Multithreading: multiple threads to share the

functional units of one processor via overlapping

– processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table – memory shared through the virtual memory mechanisms, which already support multiple processes – HW for fast thread switch; much faster than full process switch ≈ 100s to 1000s of clocks

When switch?

– Alternate instruction per thread (fine grain) – When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

SLIDE 17

CADSL

09 Apr 2013 EE-739@IITB 17

Fine-Grained Multithreading

Switches between threads on each instruction,

causing the execution of multiples threads to be interleaved

Usually done in a round-robin fashion, skipping

any stalled threads

CPU must be able to switch threads every clock
Advantage is it can hide both short and long stalls,

since instructions from other threads executed when one thread stalls

Disadvantage is it slows down execution of

individual threads, since a thread ready to execute without stalls will be delayed by instructions from

ther threads
Used on Sun’s Niagara

SLIDE 18

CADSL

09 Apr 2013 EE-739@IITB 18

Course-Grained Multithreading

Switches threads only on costly stalls, such as

L2 cache misses

Advantages

– Relieves need to have very fast thread-switching – Doesn’t slow down thread, since instructions from

ther threads issued only when the thread

encounters a costly stall

SLIDE 19

CADSL

09 Apr 2013 EE-739@IITB 19

Course-Grained Multithreading

Disadvantage is hard to overcome throughput

losses from shorter stalls, due to pipeline start- up costs

– Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen – New thread must fill pipeline before instructions can complete

Because of this start-up overhead, coarse-

grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time

Used in IBM AS/400

SLIDE 20

CADSL

09 Apr 2013 EE-739@IITB 20

For most apps, most execution units lie idle

From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

For an 8-way superscalar.

SLIDE 21

CADSL

09 Apr 2013 EE-739@IITB 21

Do both ILP and TLP?

TLP and ILP exploit two different kinds of

parallel structure in a program

Could a processor oriented at ILP to exploit

TLP?

– functional units are often idle in data path designed for ILP because of either stalls or dependences in the code

Could the TLP be used as a source of

independent instructions that might keep the processor busy during stalls?

Could TLP be used to employ the functional

units that would otherwise lie idle when insufficient ILP exists?

SLIDE 22

CADSL

09 Apr 2013 EE-739@IITB 22

Simultaneous Multi-threading ...

1 2 3 4 5 6 7 8 9

M M FX FX FP FP BR CC Cycle

One thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1 2 3 4 5 6 7 8 9

M M FX FX FP FP BR CC Cycle

Two threads, 8 units

SLIDE 23

CADSL

09 Apr 2013 EE-739@IITB 23

Simultaneous Multithreading (SMT)

Simultaneous multithreading (SMT): insight that dynamically

scheduled processor already has many HW mechanisms to support multithreading – Large set of virtual registers that can be used to hold the register sets of independent threads – Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads – Out-of-order completion allows the threads to execute out

f order, and get better utilization of the HW
Just adding a per thread renaming table and keeping separate

PCs – Independent commitment can be supported by logically keeping a separate reorder buffer for each thread

Source: Micrprocessor Report, December 6, 1999 “Compaq Chooses SMT for Alpha”

SLIDE 24

CADSL

09 Apr 2013 EE-739@IITB 24

Multithreaded Categories

T i m e ( p r

c

e s s

r

c

Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading

Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot

SLIDE 25

CADSL

09 Apr 2013 EE-739@IITB 25

Design Challenges in SMT

Since SMT makes sense only with fine-grained

implementation, impact of fine-grained scheduling on single thread performance? – A preferred thread approach sacrifices neither throughput nor single-thread performance? – Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls

Larger register file needed to hold multiple contexts
Not affecting clock cycle time, especially in

– Instruction issue - more candidate instructions need to be considered – Instruction completion - choosing which instructions to commit may be challenging

Ensuring that cache and TLB conflicts generated by SMT do

not degrade performance