Simultaneous Multi- Threaded Design Virendra Singh Associate - - PowerPoint PPT Presentation

simultaneous multi threaded design
SMART_READER_LITE
LIVE PREVIEW

Simultaneous Multi- Threaded Design Virendra Singh Associate - - PowerPoint PPT Presentation

Simultaneous Multi- Threaded Design Virendra Singh Associate Professor C omputer A rchitecture and D ependable S ystems L ab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail:


slide-1
SLIDE 1

CADSL

Simultaneous Multi- Threaded Design

Virendra Singh

Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay

http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in

EE-739: Processor Design

Lecture 34 (09 April 2013)

slide-2
SLIDE 2

CADSL

Program vs Process

  • Program is a passive entity which specifies the

logic of data manipulation and IO action

  • Process is an active entity which performs the

actions specified in a program

  • Multiple execution of a program process leads to

concurrent processes

2

09 Apr 2013 EE-739@IITB 2

slide-3
SLIDE 3

CADSL

Process

  • Process is a program in execution that can be in

a number of states

– running, waiting, ready, terminated

  • Process creation

– fork() and exec() system calls

  • Inter-process communications

– Shared memory, and message passing

  • Client-server communication

– Socket, RPC, RMI

3

09 Apr 2013 EE-739@IITB 3

slide-4
SLIDE 4

CADSL

Threads

  • A thread (a lightweight process) is a basic unit of CPU

utilization.

  • A thread has a single sequential flow of control.
  • A thread is comprised of: A thread ID, a program counter, a

register set and a stack.

  • A process is the execution environment in which threads

run. – (Recall previous definition of process: program in execution).

  • The process has the code section, data section, OS

resources (e.g. open files and signals).

  • Traditional processes have a single thread of control
  • Multi-threaded processes have multiple threads of control

– The threads share the address space and resources of

09 Apr 2013 EE-739@IITB 4

slide-5
SLIDE 5

CADSL

Single and Multithreaded Processes

5

Threads encapsulate concurrency: “Active” component Address spaces encapsulate protection: “Passive” part Keeps buggy program from trashing the system

09 Apr 2013 EE-739@IITB 5

slide-6
SLIDE 6

CADSL

Processes vs. Threads

Which of the following belong to the process and which to the thread? Program code: local or temporary data: global data: allocated resources: execution stack: memory management info: Program counter: Parent identification: Thread state: Registers: Process Thread Process Process Thread Process Thread Process Thread Thread

09 Apr 2013 EE-739@IITB 6

slide-7
SLIDE 7

CADSL

Control Blocks

  • The thread control block (TCB) contains:

– Thread state, Program Counter, Registers

  • PCB' = everything else (e.g. process id,
  • pen files, etc.)
  • The process control block (PCB) = PCB' U

TCB

09 Apr 2013 EE-739@IITB 7

slide-8
SLIDE 8

CADSL

Why use threads?

  • Because threads have minimal internal state,

it takes less time to create a thread than a process (10x speedup in UNIX).

  • It takes less time to terminate a thread.
  • It takes less time to switch to a different

thread.

  • A multi-threaded process is much cheaper

than multiple (redundant) processes.

09 Apr 2013 EE-739@IITB 8

slide-9
SLIDE 9

CADSL

Examples of Using Threads

  • Threads are useful for any application with multiple tasks

that can be run with separate threads of control.

  • A Word processor may have separate threads for:

– User input – Spell and grammar check – displaying graphics – document layout

  • A web server may spawn a thread for each client

– Can serve clients concurrently with multiple threads. – It takes less overhead to use multiple threads than to use multiple processes.

09 Apr 2013 EE-739@IITB 9

slide-10
SLIDE 10

CADSL

Examples of multithreaded programs

  • Most modern OS kernels

– Internally concurrent because have to deal with concurrent requests by multiple users – But no protection needed within kernel

  • Database Servers

– Access to shared data by many concurrent users – Also background utility processing must be done

  • Parallel Programming (More than one physical

CPU) – Split program into multiple threads for

09 Apr 2013 EE-739@IITB 10

slide-11
SLIDE 11

CADSL

Multithreaded Matrix Multiply...

X A = B C

C[1,1] = A[1,1]*B[1,1]+A[1,2]*B[2,1].. …. C[m,n]=sum of product of corresponding elements in row of A and column of B.

Each resultant element can be computed independently.

09 Apr 2013 EE-739@IITB 11

slide-12
SLIDE 12

CADSL

Multithreaded Matrix Multiply

typedef struct { int id; int size; int row, column; matrix *MA, *MB, *MC; } matrix_work_order_t; main() { int size = ARRAY_SIZE, row, column; matrix_t MA, MB,MC; matrix_work_order *work_orderp; pthread_t peer[size*zize]; ...

09 Apr 2013 EE-739@IITB 12

slide-13
SLIDE 13

CADSL

Multithreaded Matrix Multiply

/* process matrix, by row, column */ for( row = 0; row < size; row++ ) for( column = 0; column < size; column++) { id = column + row * ARRAY_SIZE; work_orderp = malloc( sizeof(matrix_work_order_t)); /* initialize all members if wirk_orderp */ pthread_create(peer[id], NULL, peer_mult, work_orderp); } } /* wait for all peers to exist*/ for( i =0; i < size*size;i++) pthread_join( peer[i], NULL ); }

09 Apr 2013 EE-739@IITB 13

slide-14
SLIDE 14

CADSL

Benefits

  • Responsiveness:

– Threads allow a program to continue running even if part is blocked. – For example, a web browser can allow user input while loading an image.

  • Resource Sharing:

– Threads share memory and resources of the process to which they belong.

  • Economy:

– Allocating memory and resources to a process is costly. – Threads are faster to create and faster to switch between.

  • Utilization of Multiprocessor Architectures:

– Threads can run in parallel on different processors. – A single threaded process can run only on one processor no matter how many are available.

09 Apr 2013 EE-739@IITB 14

slide-15
SLIDE 15

CADSL

09 Apr 2013 EE-739@IITB 15

Thread Level Parallelism (TLP)

  • ILP exploits implicit parallel operations within a

loop or straight-line code segment

  • TLP explicitly represented by the use of

multiple threads of execution that are inherently parallel

  • Goal: Use multiple instruction streams to

improve

  • 1. Throughput of computers that run many programs
  • 2. Execution time of multi-threaded programs
  • TLP could be more cost-effective to exploit

than ILP

slide-16
SLIDE 16

CADSL

09 Apr 2013 EE-739@IITB 16

New Approach: Mulithreaded Execution

  • Multithreading: multiple threads to share the

functional units of one processor via overlapping

– processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table – memory shared through the virtual memory mechanisms, which already support multiple processes – HW for fast thread switch; much faster than full process switch ≈ 100s to 1000s of clocks

  • When switch?

– Alternate instruction per thread (fine grain) – When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

slide-17
SLIDE 17

CADSL

09 Apr 2013 EE-739@IITB 17

Fine-Grained Multithreading

  • Switches between threads on each instruction,

causing the execution of multiples threads to be interleaved

  • Usually done in a round-robin fashion, skipping

any stalled threads

  • CPU must be able to switch threads every clock
  • Advantage is it can hide both short and long stalls,

since instructions from other threads executed when one thread stalls

  • Disadvantage is it slows down execution of

individual threads, since a thread ready to execute without stalls will be delayed by instructions from

  • ther threads
  • Used on Sun’s Niagara
slide-18
SLIDE 18

CADSL

09 Apr 2013 EE-739@IITB 18

Course-Grained Multithreading

  • Switches threads only on costly stalls, such as

L2 cache misses

  • Advantages

– Relieves need to have very fast thread-switching – Doesn’t slow down thread, since instructions from

  • ther threads issued only when the thread

encounters a costly stall

slide-19
SLIDE 19

CADSL

09 Apr 2013 EE-739@IITB 19

Course-Grained Multithreading

  • Disadvantage is hard to overcome throughput

losses from shorter stalls, due to pipeline start- up costs

– Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen – New thread must fill pipeline before instructions can complete

  • Because of this start-up overhead, coarse-

grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time

  • Used in IBM AS/400
slide-20
SLIDE 20

CADSL

09 Apr 2013 EE-739@IITB 20

For most apps, most execution units lie idle

From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

For an 8-way superscalar.

slide-21
SLIDE 21

CADSL

09 Apr 2013 EE-739@IITB 21

Do both ILP and TLP?

  • TLP and ILP exploit two different kinds of

parallel structure in a program

  • Could a processor oriented at ILP to exploit

TLP?

– functional units are often idle in data path designed for ILP because of either stalls or dependences in the code

  • Could the TLP be used as a source of

independent instructions that might keep the processor busy during stalls?

  • Could TLP be used to employ the functional

units that would otherwise lie idle when insufficient ILP exists?

slide-22
SLIDE 22

CADSL

09 Apr 2013 EE-739@IITB 22

Simultaneous Multi-threading ...

1 2 3 4 5 6 7 8 9

M M FX FX FP FP BR CC Cycle

One thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1 2 3 4 5 6 7 8 9

M M FX FX FP FP BR CC Cycle

Two threads, 8 units

slide-23
SLIDE 23

CADSL

09 Apr 2013 EE-739@IITB 23

Simultaneous Multithreading (SMT)

  • Simultaneous multithreading (SMT): insight that dynamically

scheduled processor already has many HW mechanisms to support multithreading – Large set of virtual registers that can be used to hold the register sets of independent threads – Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads – Out-of-order completion allows the threads to execute out

  • f order, and get better utilization of the HW
  • Just adding a per thread renaming table and keeping separate

PCs – Independent commitment can be supported by logically keeping a separate reorder buffer for each thread

Source: Micrprocessor Report, December 6, 1999 “Compaq Chooses SMT for Alpha”

slide-24
SLIDE 24

CADSL

09 Apr 2013 EE-739@IITB 24

Multithreaded Categories

T i m e ( p r

  • c

e s s

  • r

c

Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading

Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot

slide-25
SLIDE 25

CADSL

09 Apr 2013 EE-739@IITB 25

Design Challenges in SMT

  • Since SMT makes sense only with fine-grained

implementation, impact of fine-grained scheduling on single thread performance? – A preferred thread approach sacrifices neither throughput nor single-thread performance? – Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls

  • Larger register file needed to hold multiple contexts
  • Not affecting clock cycle time, especially in

– Instruction issue - more candidate instructions need to be considered – Instruction completion - choosing which instructions to commit may be challenging

  • Ensuring that cache and TLB conflicts generated by SMT do

not degrade performance