CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley - - PowerPoint PPT Presentation

cs 61a cs 98 52
SMART_READER_LITE
LIVE PREVIEW

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley - - PowerPoint PPT Presentation

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 1 / 25 Preliminaries Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25 Preliminaries Today, were going to learn how to


slide-1
SLIDE 1

CS 61A/CS 98-52

Mehrdad Niknami

University of California, Berkeley

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 1 / 25

slide-2
SLIDE 2

Preliminaries

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-3
SLIDE 3

Preliminaries

Today, we’re going to learn how to add & multiply.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-4
SLIDE 4

Preliminaries

Today, we’re going to learn how to add & multiply. Exciting!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-5
SLIDE 5

Preliminaries

Today, we’re going to learn how to add & multiply. Exciting! Let’s add two positive n-bit integers (n = 8 here):

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-6
SLIDE 6

Preliminaries

Today, we’re going to learn how to add & multiply. Exciting! Let’s add two positive n-bit integers (n = 8 here): Carry: 1 111111 Augend: 10110111 Addend: + 10011101

  • Sum:

101010100

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-7
SLIDE 7

Preliminaries

Today, we’re going to learn how to add & multiply. Exciting! Let’s add two positive n-bit integers (n = 8 here): Carry: 1 111111 Augend: 10110111 Addend: + 10011101

  • Sum:

101010100 This is called ripple-carry addition.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-8
SLIDE 8

Preliminaries

Today, we’re going to learn how to add & multiply. Exciting! Let’s add two positive n-bit integers (n = 8 here): Carry: 1 111111 Augend: 10110111 Addend: + 10011101

  • Sum:

101010100 This is called ripple-carry addition. Some questions:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-9
SLIDE 9

Preliminaries

Today, we’re going to learn how to add & multiply. Exciting! Let’s add two positive n-bit integers (n = 8 here): Carry: 1 111111 Augend: 10110111 Addend: + 10011101

  • Sum:

101010100 This is called ripple-carry addition. Some questions:

1 How big can the sum be (at most)? What is the worst case? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-10
SLIDE 10

Preliminaries

Today, we’re going to learn how to add & multiply. Exciting! Let’s add two positive n-bit integers (n = 8 here): Carry: 1 111111 Augend: 10110111 Addend: + 10011101

  • Sum:

101010100 This is called ripple-carry addition. Some questions:

1 How big can the sum be (at most)? What is the worst case? 2 How long does summation take in the worst case? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-11
SLIDE 11

Preliminaries

Today, we’re going to learn how to add & multiply. Exciting! Let’s add two positive n-bit integers (n = 8 here): Carry: 1 111111 Augend: 10110111 Addend: + 10011101

  • Sum:

101010100 This is called ripple-carry addition. Some questions:

1 How big can the sum be (at most)? What is the worst case? 2 How long does summation take in the worst case? Why? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-12
SLIDE 12

Preliminaries

Today, we’re going to learn how to add & multiply. Exciting! Let’s add two positive n-bit integers (n = 8 here): Carry: 1 111111 Augend: 10110111 Addend: + 10011101

  • Sum:

101010100 This is called ripple-carry addition. Some questions:

1 How big can the sum be (at most)? What is the worst case? 2 How long does summation take in the worst case? Why?

...we’ll come back to this!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 25

slide-13
SLIDE 13

History

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-14
SLIDE 14

History

First computer design (difference engine) in 1822 (!!) and later, the analytical engine, by Charles Babbage (1791-1871)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-15
SLIDE 15

History

First computer design (difference engine) in 1822 (!!) and later, the analytical engine, by Charles Babbage (1791-1871) First description of “MIMD” parallelism in 1842 (!!!) in Sketch of The Analytical Engine Invented by Charles Babbage, by Luigi F. Menabrea

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-16
SLIDE 16

History

First computer design (difference engine) in 1822 (!!) and later, the analytical engine, by Charles Babbage (1791-1871) First description of “MIMD” parallelism in 1842 (!!!) in Sketch of The Analytical Engine Invented by Charles Babbage, by Luigi F. Menabrea First theory of computation by Alan Turing in 1936

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-17
SLIDE 17

History

First computer design (difference engine) in 1822 (!!) and later, the analytical engine, by Charles Babbage (1791-1871) First description of “MIMD” parallelism in 1842 (!!!) in Sketch of The Analytical Engine Invented by Charles Babbage, by Luigi F. Menabrea First theory of computation by Alan Turing in 1936 First electronic analog computer created in 1942 for bombing in WWII

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-18
SLIDE 18

History

First computer design (difference engine) in 1822 (!!) and later, the analytical engine, by Charles Babbage (1791-1871) First description of “MIMD” parallelism in 1842 (!!!) in Sketch of The Analytical Engine Invented by Charles Babbage, by Luigi F. Menabrea First theory of computation by Alan Turing in 1936 First electronic analog computer created in 1942 for bombing in WWII First electronic digital computer created in 1943 ⇒ Electronic Numerical Integrator and Computer (ENIAC)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-19
SLIDE 19

History

First computer design (difference engine) in 1822 (!!) and later, the analytical engine, by Charles Babbage (1791-1871) First description of “MIMD” parallelism in 1842 (!!!) in Sketch of The Analytical Engine Invented by Charles Babbage, by Luigi F. Menabrea First theory of computation by Alan Turing in 1936 First electronic analog computer created in 1942 for bombing in WWII First electronic digital computer created in 1943 ⇒ Electronic Numerical Integrator and Computer (ENIAC) First description of parallel programs in 1958 (Stanley Gill)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-20
SLIDE 20

History

First computer design (difference engine) in 1822 (!!) and later, the analytical engine, by Charles Babbage (1791-1871) First description of “MIMD” parallelism in 1842 (!!!) in Sketch of The Analytical Engine Invented by Charles Babbage, by Luigi F. Menabrea First theory of computation by Alan Turing in 1936 First electronic analog computer created in 1942 for bombing in WWII First electronic digital computer created in 1943 ⇒ Electronic Numerical Integrator and Computer (ENIAC) First description of parallel programs in 1958 (Stanley Gill) First multiprocessor system (Multics) in 1969

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-21
SLIDE 21

History

First computer design (difference engine) in 1822 (!!) and later, the analytical engine, by Charles Babbage (1791-1871) First description of “MIMD” parallelism in 1842 (!!!) in Sketch of The Analytical Engine Invented by Charles Babbage, by Luigi F. Menabrea First theory of computation by Alan Turing in 1936 First electronic analog computer created in 1942 for bombing in WWII First electronic digital computer created in 1943 ⇒ Electronic Numerical Integrator and Computer (ENIAC) First description of parallel programs in 1958 (Stanley Gill) First multiprocessor system (Multics) in 1969 Lots of parallel computing research starting in 1970s...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-22
SLIDE 22

History

First computer design (difference engine) in 1822 (!!) and later, the analytical engine, by Charles Babbage (1791-1871) First description of “MIMD” parallelism in 1842 (!!!) in Sketch of The Analytical Engine Invented by Charles Babbage, by Luigi F. Menabrea First theory of computation by Alan Turing in 1936 First electronic analog computer created in 1942 for bombing in WWII First electronic digital computer created in 1943 ⇒ Electronic Numerical Integrator and Computer (ENIAC) First description of parallel programs in 1958 (Stanley Gill) First multiprocessor system (Multics) in 1969 Lots of parallel computing research starting in 1970s... then faded away

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-23
SLIDE 23

History

First computer design (difference engine) in 1822 (!!) and later, the analytical engine, by Charles Babbage (1791-1871) First description of “MIMD” parallelism in 1842 (!!!) in Sketch of The Analytical Engine Invented by Charles Babbage, by Luigi F. Menabrea First theory of computation by Alan Turing in 1936 First electronic analog computer created in 1942 for bombing in WWII First electronic digital computer created in 1943 ⇒ Electronic Numerical Integrator and Computer (ENIAC) First description of parallel programs in 1958 (Stanley Gill) First multiprocessor system (Multics) in 1969 Lots of parallel computing research starting in 1970s... then faded away Multi-core systems reinvigorated parallel computing around 2001

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 25

slide-24
SLIDE 24

History

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 25

slide-25
SLIDE 25

History

Long story short...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 25

slide-26
SLIDE 26

History

Long story short... Parallel computing goes back longer than you think

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 25

slide-27
SLIDE 27

History

Long story short... Parallel computing goes back longer than you think Lots of useful research from the 1900s finding life again since processors stopped getting faster

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 25

slide-28
SLIDE 28

Terminology

Some basic terminology:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 25

slide-29
SLIDE 29

Terminology

Some basic terminology: Process: A running program Processes cannot access each others’ memory by default

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 25

slide-30
SLIDE 30

Terminology

Some basic terminology: Process: A running program Processes cannot access each others’ memory by default Thread: A unit of program flow (N threads = n independent executions of code) Threads maintain their own execution contexts in a given process

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 25

slide-31
SLIDE 31

Terminology

Some basic terminology: Process: A running program Processes cannot access each others’ memory by default Thread: A unit of program flow (N threads = n independent executions of code) Threads maintain their own execution contexts in a given process Thread context: All the information a thread needs to run code This includes the location of the code that it is currently being executing, as well as its current stack frame (local variables, etc.)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 25

slide-32
SLIDE 32

Terminology

Some basic terminology: Process: A running program Processes cannot access each others’ memory by default Thread: A unit of program flow (N threads = n independent executions of code) Threads maintain their own execution contexts in a given process Thread context: All the information a thread needs to run code This includes the location of the code that it is currently being executing, as well as its current stack frame (local variables, etc.) Concurrency: Overlapping operations (X begins before Y ends)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 25

slide-33
SLIDE 33

Terminology

Some basic terminology: Process: A running program Processes cannot access each others’ memory by default Thread: A unit of program flow (N threads = n independent executions of code) Threads maintain their own execution contexts in a given process Thread context: All the information a thread needs to run code This includes the location of the code that it is currently being executing, as well as its current stack frame (local variables, etc.) Concurrency: Overlapping operations (X begins before Y ends) Parallelism: Simultaneously-occurring operations (multiple

  • perations happening at the same time)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 25

slide-34
SLIDE 34

Terminology

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 25

slide-35
SLIDE 35

Terminology

Parallel operations are always concurrent by definition

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 25

slide-36
SLIDE 36

Terminology

Parallel operations are always concurrent by definition Concurrent operations need not be in parallel (open door, open window, close door, close window)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 25

slide-37
SLIDE 37

Terminology

Parallel operations are always concurrent by definition Concurrent operations need not be in parallel (open door, open window, close door, close window) Parallelism gives you a speed boost (multiple operations at the same time), but requires N processors for N× speedup

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 25

slide-38
SLIDE 38

Terminology

Parallel operations are always concurrent by definition Concurrent operations need not be in parallel (open door, open window, close door, close window) Parallelism gives you a speed boost (multiple operations at the same time), but requires N processors for N× speedup Concurrency allows you to avoid stopping one thing before starting another, and can occur on a single processor

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 25

slide-39
SLIDE 39

Concepts

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 25

slide-40
SLIDE 40

Concepts

Distributed computation (running on multiple machines) is more difficult:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 25

slide-41
SLIDE 41

Concepts

Distributed computation (running on multiple machines) is more difficult: Needs fault-tolerance (more machines = higher failure probability)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 25

slide-42
SLIDE 42

Concepts

Distributed computation (running on multiple machines) is more difficult: Needs fault-tolerance (more machines = higher failure probability) Lack of shared memory

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 25

slide-43
SLIDE 43

Concepts

Distributed computation (running on multiple machines) is more difficult: Needs fault-tolerance (more machines = higher failure probability) Lack of shared memory More limited communication bandwidth (network slower than RAM)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 25

slide-44
SLIDE 44

Concepts

Distributed computation (running on multiple machines) is more difficult: Needs fault-tolerance (more machines = higher failure probability) Lack of shared memory More limited communication bandwidth (network slower than RAM) Time becomes problematic to handle

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 25

slide-45
SLIDE 45

Concepts

Distributed computation (running on multiple machines) is more difficult: Needs fault-tolerance (more machines = higher failure probability) Lack of shared memory More limited communication bandwidth (network slower than RAM) Time becomes problematic to handle Rich literature, e.g. actor-based models of computation (MoC) such as discrete-event, synchronous-reactive, synchronous dataflow, etc. for analyzing/designing systems with guaranteed performance or reliability

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 25

slide-46
SLIDE 46

Threading

Threading example: import threading t = threading.Thread(target=print, args=('a',)) t.start() print('b') # may print 'b' before or after 'a' t.join() # wait for t to finish

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 8 / 25

slide-47
SLIDE 47

Threading

Race condition:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 25

slide-48
SLIDE 48

Threading

Race condition: When a thread attempts to access something being modified by another thread.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 25

slide-49
SLIDE 49

Threading

Race condition: When a thread attempts to access something being modified by another thread. Race conditions are generally bad.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 25

slide-50
SLIDE 50

Threading

Race condition: When a thread attempts to access something being modified by another thread. Race conditions are generally bad. Example: import threading lst = [0] def f(): lst[0] += 1 # write 1 might occur after read 2 t = threading.Thread(target=f) t.start() f() t.join() assert lst[0] in [1, 2] # could be any of these!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 25

slide-51
SLIDE 51

Concurrency Control

Mutex (Lock in Python): Object that can prevent concurrent access (mutual-exclusion).

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 10 / 25

slide-52
SLIDE 52

Concurrency Control

Mutex (Lock in Python): Object that can prevent concurrent access (mutual-exclusion). Example: import threading lock = threading.Lock() lst = [0] def f(): lock.acquire() # waits for mutex to be available lst[0] += 1 # only one thread may run this code lock.release() # makes mutex available to others t = threading.Thread(target=f) t.start() f() t.join() assert lst[0] in [2] # will always succeed

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 10 / 25

slide-53
SLIDE 53

Concurrency Control

1However, Python code can release GIL when calling non-Python code. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 25

slide-54
SLIDE 54

Concurrency Control

Sadly, in CPython, multithreaded operations cannot occur in parallel, because there is a “global interpreter lock” (GIL). Therefore, Python code cannot be sped up in CPython.1

1However, Python code can release GIL when calling non-Python code. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 25

slide-55
SLIDE 55

Concurrency Control

Sadly, in CPython, multithreaded operations cannot occur in parallel, because there is a “global interpreter lock” (GIL). Therefore, Python code cannot be sped up in CPython.1 To obtain parallelism in CPython, you can use multiprocessing: running another copy of the program and communicating with it.

1However, Python code can release GIL when calling non-Python code. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 25

slide-56
SLIDE 56

Concurrency Control

Sadly, in CPython, multithreaded operations cannot occur in parallel, because there is a “global interpreter lock” (GIL). Therefore, Python code cannot be sped up in CPython.1 To obtain parallelism in CPython, you can use multiprocessing: running another copy of the program and communicating with it. Jython, IronPython, etc. can run Python in parallel, and most other languages support parallelism as well.

1However, Python code can release GIL when calling non-Python code. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 25

slide-57
SLIDE 57

Inter-Thread and Inter-Process Communication (IPC)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 25

slide-58
SLIDE 58

Inter-Thread and Inter-Process Communication (IPC)

Threads/processes need to communicate. Common techniques:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 25

slide-59
SLIDE 59

Inter-Thread and Inter-Process Communication (IPC)

Threads/processes need to communicate. Common techniques: Shared memory: mutating shared objects (if all on 1 machine)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 25

slide-60
SLIDE 60

Inter-Thread and Inter-Process Communication (IPC)

Threads/processes need to communicate. Common techniques: Shared memory: mutating shared objects (if all on 1 machine)

Pros: Reduces copying of data (faster/less memory)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 25

slide-61
SLIDE 61

Inter-Thread and Inter-Process Communication (IPC)

Threads/processes need to communicate. Common techniques: Shared memory: mutating shared objects (if all on 1 machine)

Pros: Reduces copying of data (faster/less memory) Cons: Must block execution until lock is acquired (slow)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 25

slide-62
SLIDE 62

Inter-Thread and Inter-Process Communication (IPC)

Threads/processes need to communicate. Common techniques: Shared memory: mutating shared objects (if all on 1 machine)

Pros: Reduces copying of data (faster/less memory) Cons: Must block execution until lock is acquired (slow)

Message-passing: sending data through thread-safe queues

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 25

slide-63
SLIDE 63

Inter-Thread and Inter-Process Communication (IPC)

Threads/processes need to communicate. Common techniques: Shared memory: mutating shared objects (if all on 1 machine)

Pros: Reduces copying of data (faster/less memory) Cons: Must block execution until lock is acquired (slow)

Message-passing: sending data through thread-safe queues

Pros: Queue can buffer & work asynchronously (faster)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 25

slide-64
SLIDE 64

Inter-Thread and Inter-Process Communication (IPC)

Threads/processes need to communicate. Common techniques: Shared memory: mutating shared objects (if all on 1 machine)

Pros: Reduces copying of data (faster/less memory) Cons: Must block execution until lock is acquired (slow)

Message-passing: sending data through thread-safe queues

Pros: Queue can buffer & work asynchronously (faster) Cons: Increases need to copy data (slower/more memory)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 25

slide-65
SLIDE 65

Inter-Thread and Inter-Process Communication (IPC)

Threads/processes need to communicate. Common techniques: Shared memory: mutating shared objects (if all on 1 machine)

Pros: Reduces copying of data (faster/less memory) Cons: Must block execution until lock is acquired (slow)

Message-passing: sending data through thread-safe queues

Pros: Queue can buffer & work asynchronously (faster) Cons: Increases need to copy data (slower/more memory)

Pipes: synchronous version of message-passing (“rendezvous”)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 25

slide-66
SLIDE 66

Inter-Thread and Inter-Process Communication (IPC)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 25

slide-67
SLIDE 67

Inter-Thread and Inter-Process Communication (IPC)

Message-passing example for parallelizing f (x) = x2: from multiprocessing import Process, Queue def f(q_in, q_out): while True: x = q_in.get() if x is None: break q_out.put(x ** 2) # real work if __name__ == '__main__': # only on main thread qs = (Queue(), Queue()) procs = [Process(target=f, args=qs) for _ in range(4)] for proc in procs: proc.start() for i in range(10): qs[0].put(i) # send inputs for i in range(10): print(qs[1].get()) # receive outputs for proc in procs: qs[0].put(None) # notify finished for proc in procs: proc.join()

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 25

slide-68
SLIDE 68

Addition

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 25

slide-69
SLIDE 69

Addition

Common parallelism technique: divide-and-conquer

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 25

slide-70
SLIDE 70

Addition

Common parallelism technique: divide-and-conquer

1 Divide problem into separate subproblems Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 25

slide-71
SLIDE 71

Addition

Common parallelism technique: divide-and-conquer

1 Divide problem into separate subproblems 2 Solve subproblems in parallel Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 25

slide-72
SLIDE 72

Addition

Common parallelism technique: divide-and-conquer

1 Divide problem into separate subproblems 2 Solve subproblems in parallel 3 Merge sub-results into main result Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 25

slide-73
SLIDE 73

Addition

Common parallelism technique: divide-and-conquer

1 Divide problem into separate subproblems 2 Solve subproblems in parallel 3 Merge sub-results into main result

XOR (and AND, and OR) are easy to parallelize:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 25

slide-74
SLIDE 74

Addition

Common parallelism technique: divide-and-conquer

1 Divide problem into separate subproblems 2 Solve subproblems in parallel 3 Merge sub-results into main result

XOR (and AND, and OR) are easy to parallelize:

1 Split each n-bit number into p pieces Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 25

slide-75
SLIDE 75

Addition

Common parallelism technique: divide-and-conquer

1 Divide problem into separate subproblems 2 Solve subproblems in parallel 3 Merge sub-results into main result

XOR (and AND, and OR) are easy to parallelize:

1 Split each n-bit number into p pieces 2 XOR each n/p-bit pair of numbers independently Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 25

slide-76
SLIDE 76

Addition

Common parallelism technique: divide-and-conquer

1 Divide problem into separate subproblems 2 Solve subproblems in parallel 3 Merge sub-results into main result

XOR (and AND, and OR) are easy to parallelize:

1 Split each n-bit number into p pieces 2 XOR each n/p-bit pair of numbers independently 3 Put back the bits together Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 25

slide-77
SLIDE 77

Addition

Common parallelism technique: divide-and-conquer

1 Divide problem into separate subproblems 2 Solve subproblems in parallel 3 Merge sub-results into main result

XOR (and AND, and OR) are easy to parallelize:

1 Split each n-bit number into p pieces 2 XOR each n/p-bit pair of numbers independently 3 Put back the bits together

Can we do something similar with addition?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 25

slide-78
SLIDE 78

Addition

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-79
SLIDE 79

Addition

Let’s go back to addition.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-80
SLIDE 80

Addition

Let’s go back to addition. We have two n-bit numbers to add.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-81
SLIDE 81

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-82
SLIDE 82

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-83
SLIDE 83

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces 2 Add each n/p-bit pair of numbers independently Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-84
SLIDE 84

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces 2 Add each n/p-bit pair of numbers independently 3 Put back the bits together Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-85
SLIDE 85

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces 2 Add each n/p-bit pair of numbers independently 3 Put back the bits together 4 ... Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-86
SLIDE 86

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces 2 Add each n/p-bit pair of numbers independently 3 Put back the bits together 4 ... 5 Profit? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-87
SLIDE 87

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces 2 Add each n/p-bit pair of numbers independently 3 Put back the bits together 4 ... 5 Profit? No? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-88
SLIDE 88

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces 2 Add each n/p-bit pair of numbers independently 3 Put back the bits together 4 ... 5 Profit? No? What’s wrong? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-89
SLIDE 89

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces 2 Add each n/p-bit pair of numbers independently 3 Put back the bits together 4 ... 5 Profit? No? What’s wrong?

We need to propagate carries!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-90
SLIDE 90

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces 2 Add each n/p-bit pair of numbers independently 3 Put back the bits together 4 ... 5 Profit? No? What’s wrong?

We need to propagate carries! How long does it take?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-91
SLIDE 91

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces 2 Add each n/p-bit pair of numbers independently 3 Put back the bits together 4 ... 5 Profit? No? What’s wrong?

We need to propagate carries! How long does it take? Θ(n) time

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-92
SLIDE 92

Addition

Let’s go back to addition. We have two n-bit numbers to add. What if we take the same approach for + as for XOR?

1 Split each n-bit number into p pieces 2 Add each n/p-bit pair of numbers independently 3 Put back the bits together 4 ... 5 Profit? No? What’s wrong?

We need to propagate carries! How long does it take? Θ(n) time (How) can we do better?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 25

slide-93
SLIDE 93

Addition

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-94
SLIDE 94

Addition

Key idea #1:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-95
SLIDE 95

Addition

Key idea #1: A carry can be either 0 or 1...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-96
SLIDE 96

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-97
SLIDE 97

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel... and then select the correct one based on carry!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-98
SLIDE 98

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel... and then select the correct one based on carry! ⇒ This is called a carry-select adder.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-99
SLIDE 99

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel... and then select the correct one based on carry! ⇒ This is called a carry-select adder. Key idea #2: We can do this recursively.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-100
SLIDE 100

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel... and then select the correct one based on carry! ⇒ This is called a carry-select adder. Key idea #2: We can do this recursively. ⇒ This is called a conditional-sum adder.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-101
SLIDE 101

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel... and then select the correct one based on carry! ⇒ This is called a carry-select adder. Key idea #2: We can do this recursively. ⇒ This is called a conditional-sum adder. How fast is a conditional-sum adder?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-102
SLIDE 102

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel... and then select the correct one based on carry! ⇒ This is called a carry-select adder. Key idea #2: We can do this recursively. ⇒ This is called a conditional-sum adder. How fast is a conditional-sum adder? Running time is proportional to maximum propagation depth

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-103
SLIDE 103

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel... and then select the correct one based on carry! ⇒ This is called a carry-select adder. Key idea #2: We can do this recursively. ⇒ This is called a conditional-sum adder. How fast is a conditional-sum adder? Running time is proportional to maximum propagation depth We solve two problems of half the size simultaneously

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-104
SLIDE 104

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel... and then select the correct one based on carry! ⇒ This is called a carry-select adder. Key idea #2: We can do this recursively. ⇒ This is called a conditional-sum adder. How fast is a conditional-sum adder? Running time is proportional to maximum propagation depth We solve two problems of half the size simultaneously We combine solutions with constant extra work

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-105
SLIDE 105

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel... and then select the correct one based on carry! ⇒ This is called a carry-select adder. Key idea #2: We can do this recursively. ⇒ This is called a conditional-sum adder. How fast is a conditional-sum adder? Running time is proportional to maximum propagation depth We solve two problems of half the size simultaneously We combine solutions with constant extra work Therefore, parallel running time is Θ(log n)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-106
SLIDE 106

Addition

Key idea #1: A carry can be either 0 or 1... and we add different pieces in parallel... and then select the correct one based on carry! ⇒ This is called a carry-select adder. Key idea #2: We can do this recursively. ⇒ This is called a conditional-sum adder. How fast is a conditional-sum adder? Running time is proportional to maximum propagation depth We solve two problems of half the size simultaneously We combine solutions with constant extra work Therefore, parallel running time is Θ(log n) However, we do more work: T(n) = 2T

n/2 + c = Θ(n log n)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 25

slide-107
SLIDE 107

Addition

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 25

slide-108
SLIDE 108

Addition

Other algorithms also exist with different trade-offs:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 25

slide-109
SLIDE 109

Addition

Other algorithms also exist with different trade-offs: Carry-skip adder

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 25

slide-110
SLIDE 110

Addition

Other algorithms also exist with different trade-offs: Carry-skip adder Carry-lookahead adder (CLA)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 25

slide-111
SLIDE 111

Addition

Other algorithms also exist with different trade-offs: Carry-skip adder Carry-lookahead adder (CLA) Kogge–Stone adder (“parallel-prefix” CLA; widely used)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 25

slide-112
SLIDE 112

Addition

Other algorithms also exist with different trade-offs: Carry-skip adder Carry-lookahead adder (CLA) Kogge–Stone adder (“parallel-prefix” CLA; widely used) Brent-Kung adder

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 25

slide-113
SLIDE 113

Addition

Other algorithms also exist with different trade-offs: Carry-skip adder Carry-lookahead adder (CLA) Kogge–Stone adder (“parallel-prefix” CLA; widely used) Brent-Kung adder Han–Carlson adder

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 25

slide-114
SLIDE 114

Addition

Other algorithms also exist with different trade-offs: Carry-skip adder Carry-lookahead adder (CLA) Kogge–Stone adder (“parallel-prefix” CLA; widely used) Brent-Kung adder Han–Carlson adder Lynch–Swartzlander spanning tree adder (fastest?)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 25

slide-115
SLIDE 115

Addition

Other algorithms also exist with different trade-offs: Carry-skip adder Carry-lookahead adder (CLA) Kogge–Stone adder (“parallel-prefix” CLA; widely used) Brent-Kung adder Han–Carlson adder Lynch–Swartzlander spanning tree adder (fastest?) ...I don’t know them. But Θ(log n) is already asymptotically optimal. :-)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 25

slide-116
SLIDE 116

Addition

Other algorithms also exist with different trade-offs: Carry-skip adder Carry-lookahead adder (CLA) Kogge–Stone adder (“parallel-prefix” CLA; widely used) Brent-Kung adder Han–Carlson adder Lynch–Swartzlander spanning tree adder (fastest?) ...I don’t know them. But Θ(log n) is already asymptotically optimal. :-) Some algorithms are better suited for hardware due to lower “fan-out”: e.g. 1 bit is too “weak” to drive 16 bits all by itself.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 25

slide-117
SLIDE 117

Multiplication

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 18 / 25

slide-118
SLIDE 118

Multiplication

How do we multiply?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 18 / 25

slide-119
SLIDE 119

Multiplication

How do we multiply? Multiplicand: 10110111 Multiplier: * 10011101

  • 10110111

+ 00000000 + 10110111 + 10110111 + 10110111 + 00000000 + 00000000 + 10110111

  • Product:

111000000111011

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 18 / 25

slide-120
SLIDE 120

Multiplication

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 25

slide-121
SLIDE 121

Multiplication

For two n-bit numbers, how long does it take in parallel?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 25

slide-122
SLIDE 122

Multiplication

For two n-bit numbers, how long does it take in parallel? Multiplication by 1 is a copy, taking Θ(1) depth

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 25

slide-123
SLIDE 123

Multiplication

For two n-bit numbers, how long does it take in parallel? Multiplication by 1 is a copy, taking Θ(1) depth There are n additions

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 25

slide-124
SLIDE 124

Multiplication

For two n-bit numbers, how long does it take in parallel? Multiplication by 1 is a copy, taking Θ(1) depth There are n additions Divide-and-conquer therefore takes Θ(log n) additions

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 25

slide-125
SLIDE 125

Multiplication

For two n-bit numbers, how long does it take in parallel? Multiplication by 1 is a copy, taking Θ(1) depth There are n additions Divide-and-conquer therefore takes Θ(log n) additions Each addition takes Θ(log n) depth

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 25

slide-126
SLIDE 126

Multiplication

For two n-bit numbers, how long does it take in parallel? Multiplication by 1 is a copy, taking Θ(1) depth There are n additions Divide-and-conquer therefore takes Θ(log n) additions Each addition takes Θ(log n) depth Total depth is therefore Θ

  • (log n)2

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 25

slide-127
SLIDE 127

Multiplication

For two n-bit numbers, how long does it take in parallel? Multiplication by 1 is a copy, taking Θ(1) depth There are n additions Divide-and-conquer therefore takes Θ(log n) additions Each addition takes Θ(log n) depth Total depth is therefore Θ

  • (log n)2

...can we do better? :-)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 25

slide-128
SLIDE 128

Multiplication

For two n-bit numbers, how long does it take in parallel? Multiplication by 1 is a copy, taking Θ(1) depth There are n additions Divide-and-conquer therefore takes Θ(log n) additions Each addition takes Θ(log n) depth Total depth is therefore Θ

  • (log n)2

...can we do better? :-) How?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 25

slide-129
SLIDE 129

Multiplication

2Simplified; detailed analysis is a little tedious. See here. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 25

slide-130
SLIDE 130

Multiplication

Carry-save addition: reduce every a + b + c into r + s in parallel:

2Simplified; detailed analysis is a little tedious. See here. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 25

slide-131
SLIDE 131

Multiplication

Carry-save addition: reduce every a + b + c into r + s in parallel: Compute all carry bits r independently ⇒ This is just OR, so Θ(1) depth

2Simplified; detailed analysis is a little tedious. See here. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 25

slide-132
SLIDE 132

Multiplication

Carry-save addition: reduce every a + b + c into r + s in parallel: Compute all carry bits r independently ⇒ This is just OR, so Θ(1) depth Compute all sums-excluding-carries s independently ⇒ This is just XOR, so Θ(1) depth

2Simplified; detailed analysis is a little tedious. See here. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 25

slide-133
SLIDE 133

Multiplication

Carry-save addition: reduce every a + b + c into r + s in parallel: Compute all carry bits r independently ⇒ This is just OR, so Θ(1) depth Compute all sums-excluding-carries s independently ⇒ This is just XOR, so Θ(1) depth Recurse on new r1 + s1 + r2 + s2 + . . . until final r + s is obtained. ⇒ This takes Θ(log n) levels of recursion

2Simplified; detailed analysis is a little tedious. See here. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 25

slide-134
SLIDE 134

Multiplication

Carry-save addition: reduce every a + b + c into r + s in parallel: Compute all carry bits r independently ⇒ This is just OR, so Θ(1) depth Compute all sums-excluding-carries s independently ⇒ This is just XOR, so Θ(1) depth Recurse on new r1 + s1 + r2 + s2 + . . . until final r + s is obtained. ⇒ This takes Θ(log n) levels of recursion Compute final sum in additional Θ(log n) depth

2Simplified; detailed analysis is a little tedious. See here. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 25

slide-135
SLIDE 135

Multiplication

Carry-save addition: reduce every a + b + c into r + s in parallel: Compute all carry bits r independently ⇒ This is just OR, so Θ(1) depth Compute all sums-excluding-carries s independently ⇒ This is just XOR, so Θ(1) depth Recurse on new r1 + s1 + r2 + s2 + . . . until final r + s is obtained. ⇒ This takes Θ(log n) levels of recursion Compute final sum in additional Θ(log n) depth Total depth is therefore Θ(log n)!2

2Simplified; detailed analysis is a little tedious. See here. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 25

slide-136
SLIDE 136

Parallel Prefix

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 25

slide-137
SLIDE 137

Parallel Prefix

There isn’t too much special about addition from basic arithmetic.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 25

slide-138
SLIDE 138

Parallel Prefix

There isn’t too much special about addition from basic arithmetic. Often the same tricks apply to any binary operator that is associative!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 25

slide-139
SLIDE 139

Parallel Prefix

There isn’t too much special about addition from basic arithmetic. Often the same tricks apply to any binary operator that is associative! Parallel addition can be generalized this way, called “parallel prefix”:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 25

slide-140
SLIDE 140

Parallel Prefix

There isn’t too much special about addition from basic arithmetic. Often the same tricks apply to any binary operator that is associative! Parallel addition can be generalized this way, called “parallel prefix”: Say we want to compute cumulative sum of 1, 2, 3, ...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 25

slide-141
SLIDE 141

Parallel Prefix

There isn’t too much special about addition from basic arithmetic. Often the same tricks apply to any binary operator that is associative! Parallel addition can be generalized this way, called “parallel prefix”: Say we want to compute cumulative sum of 1, 2, 3, ... First, group into binary tree: (((1 2) (3 4)) ((5 6) (7 8))) ...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 25

slide-142
SLIDE 142

Parallel Prefix

There isn’t too much special about addition from basic arithmetic. Often the same tricks apply to any binary operator that is associative! Parallel addition can be generalized this way, called “parallel prefix”: Say we want to compute cumulative sum of 1, 2, 3, ... First, group into binary tree: (((1 2) (3 4)) ((5 6) (7 8))) ... Then, evaluate sums for all nodes recursively toward root

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 25

slide-143
SLIDE 143

Parallel Prefix

There isn’t too much special about addition from basic arithmetic. Often the same tricks apply to any binary operator that is associative! Parallel addition can be generalized this way, called “parallel prefix”: Say we want to compute cumulative sum of 1, 2, 3, ... First, group into binary tree: (((1 2) (3 4)) ((5 6) (7 8))) ... Then, evaluate sums for all nodes recursively toward root Finally, propagate sums back down from root to right-hand children

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 25

slide-144
SLIDE 144

Parallel Prefix

There isn’t too much special about addition from basic arithmetic. Often the same tricks apply to any binary operator that is associative! Parallel addition can be generalized this way, called “parallel prefix”: Say we want to compute cumulative sum of 1, 2, 3, ... First, group into binary tree: (((1 2) (3 4)) ((5 6) (7 8))) ... Then, evaluate sums for all nodes recursively toward root Finally, propagate sums back down from root to right-hand children This is a very flexible operation, useful as a basic parallel building block. (More notes can be found on MIT’s website.)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 25

slide-145
SLIDE 145

MapReduce

A common pattern for parallel data processing is:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 25

slide-146
SLIDE 146

MapReduce

A common pattern for parallel data processing is: from functools import reduce

  • utputs = map(lambda x: ..., inputs)

result = reduce(lambda r, x: ..., outputs, initial)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 25

slide-147
SLIDE 147

MapReduce

A common pattern for parallel data processing is: from functools import reduce

  • utputs = map(lambda x: ..., inputs)

result = reduce(lambda r, x: ..., outputs, initial) map you have already seen: it transforms elements

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 25

slide-148
SLIDE 148

MapReduce

A common pattern for parallel data processing is: from functools import reduce

  • utputs = map(lambda x: ..., inputs)

result = reduce(lambda r, x: ..., outputs, initial) map you have already seen: it transforms elements reduce is anything like +, × to summarize elements

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 25

slide-149
SLIDE 149

MapReduce

A common pattern for parallel data processing is: from functools import reduce

  • utputs = map(lambda x: ..., inputs)

result = reduce(lambda r, x: ..., outputs, initial) map you have already seen: it transforms elements reduce is anything like +, × to summarize elements Transformations assumed to ignore order (to allow parallelism)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 25

slide-150
SLIDE 150

MapReduce

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 23 / 25

slide-151
SLIDE 151

MapReduce

Google recognized this and built a fast framework called MapReduce for automatically parallelizing & distributing such code across a cluster

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 23 / 25

slide-152
SLIDE 152

MapReduce

Google recognized this and built a fast framework called MapReduce for automatically parallelizing & distributing such code across a cluster MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat (2004)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 23 / 25

slide-153
SLIDE 153

MapReduce

Google recognized this and built a fast framework called MapReduce for automatically parallelizing & distributing such code across a cluster MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat (2004) System and method for efficient large-scale data processing U.S. Patent 7,650,331

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 23 / 25

slide-154
SLIDE 154

MapReduce

Google recognized this and built a fast framework called MapReduce for automatically parallelizing & distributing such code across a cluster MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat (2004) System and method for efficient large-scale data processing U.S. Patent 7,650,331 Fault-tolerance is handled automatically

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 23 / 25

slide-155
SLIDE 155

MapReduce

Google recognized this and built a fast framework called MapReduce for automatically parallelizing & distributing such code across a cluster MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat (2004) System and method for efficient large-scale data processing U.S. Patent 7,650,331 Fault-tolerance is handled automatically (why is this possible?)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 23 / 25

slide-156
SLIDE 156

MapReduce

Google recognized this and built a fast framework called MapReduce for automatically parallelizing & distributing such code across a cluster MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat (2004) System and method for efficient large-scale data processing U.S. Patent 7,650,331 Fault-tolerance is handled automatically (why is this possible?) Apache Hadoop later developed as an open-source implementation

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 23 / 25

slide-157
SLIDE 157

MapReduce

Google recognized this and built a fast framework called MapReduce for automatically parallelizing & distributing such code across a cluster MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat (2004) System and method for efficient large-scale data processing U.S. Patent 7,650,331 Fault-tolerance is handled automatically (why is this possible?) Apache Hadoop later developed as an open-source implementation “MapReduce” became a general programming model for distributed data processing

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 23 / 25

slide-158
SLIDE 158

MapReduce

Google recognized this and built a fast framework called MapReduce for automatically parallelizing & distributing such code across a cluster MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat (2004) System and method for efficient large-scale data processing U.S. Patent 7,650,331 Fault-tolerance is handled automatically (why is this possible?) Apache Hadoop later developed as an open-source implementation “MapReduce” became a general programming model for distributed data processing Spark (Matei Zaharia, UCB AMPLab, now at Databricks) developed as a faster implementation that processes data in RAM

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 23 / 25

slide-159
SLIDE 159

MapReduce

Parallel map is easy in Python!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 24 / 25

slide-160
SLIDE 160

MapReduce

Parallel map is easy in Python! >>> import math >>> from multiprocessing.pool import Pool >>> pool = Pool() >>> pool.map(math.sqrt, [1, 2, 3, 4]) [1.0, 1.4142135623730951, 1.7320508075688772, 2.0]

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 24 / 25

slide-161
SLIDE 161

MapReduce

Parallel map is easy in Python! >>> import math >>> from multiprocessing.pool import Pool >>> pool = Pool() >>> pool.map(math.sqrt, [1, 2, 3, 4]) [1.0, 1.4142135623730951, 1.7320508075688772, 2.0] This a higher-level threading construct that makes your life simpler.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 24 / 25

slide-162
SLIDE 162

MapReduce

Not everything fits into a MapReduce model

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 25 / 25

slide-163
SLIDE 163

MapReduce

Not everything fits into a MapReduce model Inputs may be generated on the fly

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 25 / 25

slide-164
SLIDE 164

MapReduce

Not everything fits into a MapReduce model Inputs may be generated on the fly Mappers might depend on many inputs

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 25 / 25

slide-165
SLIDE 165

MapReduce

Not everything fits into a MapReduce model Inputs may be generated on the fly Mappers might depend on many inputs Mappers may need lots of communication

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 25 / 25

slide-166
SLIDE 166

MapReduce

Not everything fits into a MapReduce model Inputs may be generated on the fly Mappers might depend on many inputs Mappers may need lots of communication Computation may not be nicely “layered” at all

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 25 / 25

slide-167
SLIDE 167

MapReduce

Not everything fits into a MapReduce model Inputs may be generated on the fly Mappers might depend on many inputs Mappers may need lots of communication Computation may not be nicely “layered” at all ...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 25 / 25

slide-168
SLIDE 168

MapReduce

Not everything fits into a MapReduce model Inputs may be generated on the fly Mappers might depend on many inputs Mappers may need lots of communication Computation may not be nicely “layered” at all ... Parallel & distributed computation still an open research problem.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 25 / 25