Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg - - PowerPoint PPT Presentation

efficient multi ported memories for fpgas
SMART_READER_LITE
LIVE PREVIEW

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg - - PowerPoint PPT Presentation

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010 Parallelism in FPGAs Larger SoCs on FPGAs Parallel Systems Parallel systems on FPGAs


slide-1
SLIDE 1

Efficient Multi-Ported Memories for FPGAs

Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010

slide-2
SLIDE 2

2

Parallelism in FPGAs

 Larger SoCs on FPGAs →Parallel Systems  Parallel systems on FPGAs will need:

− Queueing − Data sharing − Communication − Synchronization

 Boils down to:

− FIFOs − Register files

We can do all these with multi-ported memories

slide-3
SLIDE 3

3

Multi-Ported Memory

X X X X

Existing workarounds are ad-hoc, “roll-your-own”, and have limited parallelism.

slide-4
SLIDE 4

4

Conventional Approaches

slide-5
SLIDE 5

5

2W/2R Multi-Ported Memory

Doesn't exist on FPGAs Altera used to have one (Mercury)

slide-6
SLIDE 6

6

Stratix III Building Blocks

M9K (eg: 32 x 256) M144K (eg: 32 x 4098)

Adaptive Logic Modules

Registers LUTs Adders

Block RAMs Flexible, but slow Fast, but inflexible

slide-7
SLIDE 7

7

2W/2R Pure-ALM

Scales very poorly with memory depth

slide-8
SLIDE 8

8

1W/nR Replication

Multiple read ports Only one write port

slide-9
SLIDE 9

9

mW/nR Banking

Multiple write ports Fragmented data

slide-10
SLIDE 10

10

mW/nR “Multipumping”

Multiple read/write ports No fragmentation Divides clock speed Read/write ordering

slide-11
SLIDE 11

11

Block RAMs: Simple Dual Port

Read Write

slide-12
SLIDE 12

12

Block RAMs: True Dual Port

R / W R / W

slide-13
SLIDE 13

13

“Pure Multipumping”

Read as banked memory (multiple reads)

slide-14
SLIDE 14

14

“Pure Multipumping”

Write as replicated memory (avoids fragmentation)

slide-15
SLIDE 15

15

Methodology

 Generate design variations over space

− Vary # of ports, depth, type of memories

 1W/2R to 8W/16R  2 to 256 elements deep  Pure-ALM, M9K, MLAB, Multipumped

− Wrap in testbench for timing and correctness

 Target Quartus 9.0 to Stratix III

− No synthesis optimizations for speed or area − Standard P&R effort (speed, avg. over 10 runs)

 Measure area as Total Equivalent Area

− Expresses area in a single unit (ALMs)

slide-16
SLIDE 16

16

Conventional Multi-Porting Performance

slide-17
SLIDE 17

17

1W/2R Pure-ALM Area vs. Speed

NiosII/f 290 MHz 500 ALMs Smaller

Faster Too big and slow!

slide-18
SLIDE 18

18

1W/2R Replicated vs. Pure-ALM

slide-19
SLIDE 19

19

1W/2R “Pure Multipumping”

slide-20
SLIDE 20

20

LVT-Based Multi-Ported Memories

slide-21
SLIDE 21

21

LVT-Based Memory

slide-22
SLIDE 22

22

LVT-Based Memory

Begin with one block RAM

slide-23
SLIDE 23

23

LVT-Based Memory

Replicate for two read ports

slide-24
SLIDE 24

24

LVT-Based Memory

Bank for two write ports

slide-25
SLIDE 25

25

LVT-Based Memory

Select bank to read from

slide-26
SLIDE 26

26

LVT-Based Memory

Add bank lookup table

slide-27
SLIDE 27

27

LVT-Based Memory

slide-28
SLIDE 28

28

Live Value Table Operation

slide-29
SLIDE 29

29

LVT Operation

2W/2R, 4-deep

slide-30
SLIDE 30

30

W0

LVT Operation

W0 W1 R0 R1 Live Value Table Write Addresses Read Addresses

1 2 3

slide-31
SLIDE 31

31

W0

LVT Operation: Write

W0 W1 R0 R1 42 @ 1 23 @ 3

Records which write port last updated a location

1 2 3 1

slide-32
SLIDE 32

32

W0

LVT Operation: Read

W0 W1 R0 R1 @ 1 @ 3 1

1

Steers read port to correct memory bank

1 2 3

slide-33
SLIDE 33

33

LVT Implementation

LVT remains practical because it is very narrow

slide-34
SLIDE 34

34

LVT Operation

Small Pure-ALM memory controlling larger block RAMs

slide-35
SLIDE 35

35

Advantages of LVTs

 LVTs add a layer of indirection

− Everything operates in parallel − Makes banked memory behave as consistent unit

 LVTs are narrow

− Word width = log2(# of write ports) < 4 bits typically − Pure-ALM, but practical size and speed

slide-36
SLIDE 36

36

LVT Performance

slide-37
SLIDE 37

37

2W/4R Pure-ALM

slide-38
SLIDE 38

38

2W/4R LVT-based vs. Pure-ALM

84% smaller 43% faster

412 MHz to 375 MHz

slide-39
SLIDE 39

39

2W/4R Multipumping

Must be careful about read/write ordering!

slide-40
SLIDE 40

40

Multipumping Performance

slide-41
SLIDE 41

41

2W/4R Multipumping

slide-42
SLIDE 42

42

2W/4R Multipumping

Pure Multipumping (279 MHz)

slide-43
SLIDE 43

43

4W/8R Multipumping

Worsens as # of ports increases

slide-44
SLIDE 44

44

2W/4R Multipumping

28% smaller

  • n average

193 MHz to 174 MHz

54% slower

  • n average
slide-45
SLIDE 45

45

Conclusions

 Pure multipumped memories are better for

memories with few ports or low speed.

 LVT-based memories are faster and smaller

than Pure-ALM memories.

 LVT-based memories are faster than pure

multipumping, but at a cost in area.

slide-46
SLIDE 46

46

Future Work

 Pure multipumping for LVT-based memories

− Build banks with 2W/4R pure multipumping blocks − Possible further area improvement

 Relaxing the read/write order for multipumping

− Allows multiplexing the write ports − Leaves designer to watch for WAR violations

slide-47
SLIDE 47

47

Thank You