Non-volatile memory & Datapath component (3) Prof. Usagi - - PowerPoint PPT Presentation

non volatile memory datapath component 3
SMART_READER_LITE
LIVE PREVIEW

Non-volatile memory & Datapath component (3) Prof. Usagi - - PowerPoint PPT Presentation

Non-volatile memory & Datapath component (3) Prof. Usagi Recap: Memory hierarchy in modern processor architectures Processor fastest Processor < 1ns Core fastest Registers 32 or 64 words L1 $ L2 $ SRAM $ a few ns KBs ~


slide-1
SLIDE 1

Non-volatile memory & Datapath component (3)

  • Prof. Usagi
slide-2
SLIDE 2

Recap: Memory “hierarchy” in modern processor architectures

2

Processor

DRAM Storage SRAM $

Processor Core

Registers

larger fastest < 1ns

tens of ns tens of ns

a few ns

GBs TBs

32 or 64 words KBs ~ MBs

L1 $ L2 $ L3 $

fastest larger

slide-3
SLIDE 3

Register

Clk

D Flip- flop D Q Input 1 Output 1 D Flip- flop D Q Input 2 Output 2 D Flip- flop D Q Input 3 Output 3 D Flip- flop D Q Input 4 Output 4 D Flip- flop D Q Input 5 Output 5 D Inpu

  • Register: a sequential component that can store multiple bits
  • A basic register can be built simply by using multiple D-FFs

3

Recap: Registers

slide-4
SLIDE 4

Recap: A Classical 6-T SRAM Cell

4

bitline bitline’ wordline Q’ Q

Sense Amplifier

slide-5
SLIDE 5

MUX

Recap: SRAM array

5

Decoder 1 2 n-1

Sense Amp Sense Amp Sense Amp Sense Amp

wd0 wd1 wd2 wd(m-1) We can only work on cells sharing the same word line simultaneously upper bits of address lower bits of address

slide-6
SLIDE 6
  • 1 transistor (rather than 6)
  • Relies on large capacitor to store

bit

  • Write: transistor conducts, data

voltage level gets stored on top plate of capacitor

  • Read: look at the value of d
  • Problem: Capacitor discharges
  • ver time
  • Must “refresh” regularly, by reading

d and then writing it right back

6

Recap: DRAM cell

wordline data

slide-7
SLIDE 7

Recap: DRAM array

7

Row Decoder 1 2 n-1 upper bits of address

Row Buffer

lower bits of address Usually 4K — the page size of your OS!

MUX

slide-8
SLIDE 8

Recap: Latency of volatile memory

8

Size (Transistors per bit) Latency (ns) Register 18T ~ 0.1 ns SRAM 6T ~ 0.5 ns DRAM 1T 50-100 ns

slide-9
SLIDE 9
  • Which side is faster in executing the for-loop?
  • A. Left
  • B. Right
  • C. About the same

9

Recap: Thinking about programming

struct student_record { int id; double homework; double midterm; double final; }; int main(int argc, char **argv) { int i,j; double midterm_average=0.0; int number_of_records = 10000000; struct timeval time_start, time_end; struct student_record *records; records = (struct student_record*)malloc(sizeof(struct student_record)*number_of_records); init(number_of_records,records); for (j = 0; j < 100; j++) for (i = 0; i < number_of_records; i++) midterm_average+=records[i].midterm; printf("average: %lf\n",midterm_average/ number_of_records); free(records); return 0; } int main(int argc, char **argv) { int i,j; double midterm_average=0.0; int number_of_records = 10000000; struct timeval time_start, time_end; id = (int*)malloc(sizeof(int)*number_of_records); midterm = (double*)malloc(sizeof(double)*number_of_records); final = (double*)malloc(sizeof(double)*number_of_records); homework = (double*)malloc(sizeof(double)*number_of_records); init(number_of_records); for (j = 0; j < 100; j++) for (i = 0; i < number_of_records; i++) midterm_average+=midterm[i]; free(id); free(midterm); free(final); free(homework); return 0; }

More row buffer hits in the DRAM, more SRAM hits

slide-10
SLIDE 10

Recap: Memory “hierarchy” in modern processor architectures

10

Processor

DRAM Storage SRAM $

Processor Core

Registers

larger fastest < 1ns

tens of ns tens of ns

a few ns

GBs TBs

32 or 64 words KBs ~ MBs

L1 $ L2 $ L3 $

fastest larger Volatile Non-Volatile

slide-11
SLIDE 11
  • Floating gate made by

polycrystalline silicon trap electrons

  • The voltage level within

the floating gate determines the value of the cell

  • The floating gates will

wear out eventually

11

Recap: Flash memory

slide-12
SLIDE 12

Recap: Types of Flash Chips

12

Single-Level Cell (SLC) Multi-Level Cell (MLC) Triple-Level Cell (TLC) 2 voltage levels, 1-bit 4 voltage levels, 2-bit 8 voltage levels, 3-bit Quad-Level Cell (QLC) 16 voltage levels, 4-bit

slide-13
SLIDE 13
  • Non-volatile memory — case study: flash memory
  • Sequential Datapath Components

13

Outline

slide-14
SLIDE 14

Programming in MLC

14

Multi-Level Cell (MLC) 4 voltage levels, 2-bit 1 1 1 0 0 1 0 0 3.1400000000000001243449787580 = 0x40091EB851EB851F

= 01000000 00001001 00011110 10111000 01010001 11101011 10000101 00011111

11 10 01 00 1 Phase to finish programming the first page! 11 10 01 00

1st page

phase #1 phase #1

slide-15
SLIDE 15

Programming the 2nd page in MLC

15

Multi-Level Cell (MLC) 4 voltage levels, 2-bit 1 1 1 0 0 1 0 0 3.1400000000000001243449787580 = 0x40091EB851EB851F

= 01000000 00001001 00011110 10111000 01010001 11101011 10000101 00011111

11 10 01 00 2 Phase to finish programming the second page! 11 10 01 00

1st page

phase #1 phase #1

2nd page

= 01000000 00001001 00011110 10111000 01010001 11101011 10000101 00011111

phase #2 phase #2

slide-16
SLIDE 16

Optimizing 1st Page Programming in MLC

16

Multi-Level Cell (MLC) 4 voltage levels, 2-bit 1 1 1 0 0 1 0 0 3.1400000000000001243449787580 = 0x40091EB851EB851F

= 01000000 00001001 00011110 10111000 01010001 11101011 10000101 00011111

1 1 Phase to finish programming the first page! — the phase is shorter now 1

phase #1 phase #1

1

1st page

slide-17
SLIDE 17

2nd Page Programming in MLC

17

Multi-Level Cell (MLC) 4 voltage levels, 2-bit 1 1 1 0 0 1 0 0 3.1400000000000001243449787580 = 0x40091EB851EB851F

= 01000000 00001001 00011110 10111000 01010001 11101011 10000101 00011111

11 10 01 00 11 10 01 00

phase #1 phase #1

1st page

= 01000000 00001001 00011110 10111000 01010001 11101011 10000101 00011111

phase #2 phase #2

2 Phase to finish programming the second page!

2nd page

slide-18
SLIDE 18
  • Regarding the following flash memory characteristics, please identify how

many of the following statements are correct

① Flash memory cells can only be programmed with limited times ② The reading latency of flash memory cells can be largely different from programming ③ The latency of programming different flash memory pages can be different ④ The programmed cell cannot be reprogrammed again unless its charge level is refilled to the top-level

  • A. 0
  • B. 1
  • C. 2
  • D. 3
  • E. 4

18

Flash memory characteristics

Poll close in

slide-19
SLIDE 19

Program-erase cycles: SLC v.s. MLC v.s. TLC v.s. QLC

19

slide-20
SLIDE 20

Flash performance

20

Reads: less than 150us Program/write: less than 2ms Erase: less than 3.6ms

Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. Characterizing flash memory: anomalies, observations, and applications. In MICRO 2009.

Read Time(µs) 35 70 105 140

A-SLC2 A-SLC4 A-SLC8 B-SLC2 50nm B-SLC4 72nm E-SLC8 B-MLC8 72nm B-MLC32 50nm C-MLC64 43nm D-MLC32 E-MLC8

Program Time(µs) '- 500 1,000 1,500 2,000

A-SLC2 A-SLC4 A-SLC8 B-SLC2 50nm B-SLC4 72nm E-SLC8 B-MLC8 72nm B-MLC32 50nm C-MLC64 43nm D-MLC32 E-MLC8

Erase Time(µs) 1000 2000 3000 4000

A-SLC2 A-SLC4 A-SLC8 B-SLC2 50nm B-SLC4 72nm E-SLC8 B-MLC8 72nm B-MLC32 50nm C-MLC64 43nm D-MLC32 E-MLC8

Similar relative performance for reads, writes and erases

SLC SLC SLC MLC MLC MLC

Not a good practice

slide-21
SLIDE 21
  • Regarding the following flash memory characteristics, please identify how

many of the following statements are correct

① Flash memory cells can only be programmed with limited times ② The reading latency of flash memory cells can be largely different from programming ③ The latency of programming different flash memory pages can be different ④ The programmed cell cannot be reprogrammed again unless its charge level is refilled to the top-level

  • A. 0
  • B. 1
  • C. 2
  • D. 3
  • E. 4

21

Flash memory characteristics

slide-22
SLIDE 22
  • Flash pages must be erased in “blocks”

22

Basic flash operations

Block #0

………………… Page #: 0 1 2 3 4 5 6 7

n-8n-7 n-6 n-5 n-4 n-3n-2 n-1

Block #1

…………………

Block #2

………………… ………… ………… …………

Block #n-2

…………………

Block #n-1

…………………

Free Page Program Read Erase Programmed page

slide-23
SLIDE 23
  • The bit is stored in the crystal

structure of a tiny spec of metal.

  • To write, it melts the metal (650C)
  • let it cool quickly or slowly to set the

value

  • Crystaline and amorphous states have

different resistance

23

Phase change memory

slide-24
SLIDE 24
  • Bits stored as magnetic orientation of a thin film
  • Change the state using polarized electrons (!)
  • Depending on polarization, resistance differs
  • More complex cell structure
  • Great promise — potential DRAM replacement
  • Roughly the same speed, power, and bandwidth.
  • But it’s durable!

24

Spin-torque transfer

slide-25
SLIDE 25

Non-volatile memory technologies

25

H.D.D Flash Optane STT-MRAM Latency

~ 10-15 ms ~ 100 us (read) ~ 1 ms (write) 7 us (read) 18 us (write) 35 ns

Bandwidth

~200 MB/Sec 3.5 GB/sec (read) 2.1 GB/sec (write) 1.35 GB/sec (read) 290 MB/sec (write)

Dollar/GB

0.0295 0.583 2.18

Flash is still the most convincing technology for now

slide-26
SLIDE 26
  • Software designer should be

aware of the characteristics

  • f underlying hardware

components

26

If programmer doesn’t know flash “features”

slide-27
SLIDE 27
  • Size:
  • 32-bit CLA with 4-bit CLAs — requires 8 of 4-bit CLA
  • Each requires 116 for the CLA 4*(4*6+8) for the A+B — 244 gates
  • 1952 transistors
  • 32-bit CRA
  • 1600 transistors
  • Delay
  • 32-bit CLA with 8 4-bit CLAs
  • 2 gates * 8 = 16
  • 32-bit CRA
  • 64 gates

27

CLA v.s. Carry-ripple

Win! Win! Area-Delay Trade-off!

slide-28
SLIDE 28

Serial Adder

28

slide-29
SLIDE 29

The basic idea

29

C1 C2

y(t) S(t) Clk x(t) Mealy Machine

C1 C2

Clk ai Si bi Feed ai and bi and generate si at time i. Where is ci and ci+1? ci ci+1

slide-30
SLIDE 30

The basic idea

30

Full Adder

si Clk ai bi ci ci+1

slide-31
SLIDE 31

Excitation Table of Serial Adder

31

ai bi ci ci+1 si 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-32
SLIDE 32

Excitation Table of Serial Adder

32

ai bi ci ci+1 si 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

ai bi si

D Flip- flop D Q

slide-33
SLIDE 33
  • Consider the following adders?

① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 32-bit serial adders made with 4-bit CLA adders ④ 32-bit serial adders made with 1-bit full adders

  • A. Area: (1) > (2) > (3) > (4) Delay: (1) < (2) < (3) < (4)
  • B. Area: (1) > (3) > (2) > (4) Delay: (1) < (3) < (2) < (4)
  • C. Area: (1) > (3) > (4) > (2) Delay: (1) < (3) < (4) < (2)
  • D. Area: (1) > (2) > (3) > (4) Delay: (1) < (3) < (2) < (4)
  • E. Area: (1) > (3) > (2) > (4) Delay: (1) < (3) < (4) < (2)

33

Area/Delay of adders

Poll close in

slide-34
SLIDE 34
  • Consider the following adders?

① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 32-bit serial adders made with 4-bit CLA adders ④ 32-bit serial adders made with 1-bit full adders

  • A. Area: (1) > (2) > (3) > (4) Delay: (1) < (2) < (3) < (4)
  • B. Area: (1) > (3) > (2) > (4) Delay: (1) < (3) < (2) < (4)
  • C. Area: (1) > (3) > (4) > (2) Delay: (1) < (3) < (4) < (2)
  • D. Area: (1) > (2) > (3) > (4) Delay: (1) < (3) < (2) < (4)
  • E. Area: (1) > (3) > (2) > (4) Delay: (1) < (3) < (4) < (2)

34

Area/Delay of adders

Each CLA — 2-gate delay — 8*2+1 ~ 17 Each carry — 2-gate delay — 64 Each CLA — (3-gate delay + 2-gate delay)*8 cycles — 5*8+1 = 41 Each CLA — (2-gate delay + 2-gate delay)*32 cycles — 4*32 = 128

slide-35
SLIDE 35
  • Consider the following adders. Assume each gate delay is 1ns and the

delay in a register is 2ns. Please rank their maximum operating frequencies

① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 32-bit serial adders made with 4-bit CLA adders ④ 32-bit serial adders made with 1-bit full adders

  • A. (1) > (2) > (3) > (4)
  • B. (2) > (1) > (4) > (3)
  • C. (2) > (1) > (3) > (4)
  • D. (4) > (3) > (2) > (1)
  • E. (4) > (3) > (1) > (2)

35

Frequency

Poll close in

slide-36
SLIDE 36
  • Consider the following adders. Assume each gate delay is 1ns and the

delay in a register is 2ns. Please rank their maximum operating frequencies

① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 32-bit serial adders made with 4-bit CLA adders ④ 32-bit serial adders made with 1-bit full adders

  • A. (1) > (2) > (3) > (4)
  • B. (2) > (1) > (4) > (3)
  • C. (2) > (1) > (3) > (4)
  • D. (4) > (3) > (2) > (1)
  • E. (4) > (3) > (1) > (2)

36

Frequency

1 17ns = 58.8MHz 1 64ns = 15.6MHz 1 5ns = 200MHz 1 4ns = 250MHz

slide-37
SLIDE 37
  • Assignment #4 due tonight — Chapter 4.8-4.9 & 5.2-5.4
  • Lab 5 is up — due this Thursday
  • Watch the video and read the instruction BEFORE your session
  • There are links on both course webpage and iLearn lab section
  • Submit through iLearn > Labs
  • Office Hours
  • All office hours share the same meeting instance — if you have registered once, you

cannot do it again.

  • Zoom does not resend registration confirmation and does not allow us to “re-approve” if

you have registered

  • The only way is to dig out the e-mail from Zoom
  • Last reading quiz due next Tuesday
  • Check your grades in iLearn

37

Announcement

slide-38
SLIDE 38

つづく

Electrical Computer Engineering Science

120A