[PPT] - IEE5008 Autumn 2012 Memory Systems 3D Stacking SRAM PowerPoint Presentation

SLIDE 1

IEE5008 –Autumn 2012 Memory Systems 3D Stacking SRAM

Anwar,Hossameldin Department of Electronics Engineering National Chiao Tung University Eng_hossam123@yahoo.com

Anwar,Hossameldin 2012

SLIDE 2

Outline

Introduction
3D Technology Process
Physical Characteristics of d2d Vias
Planar SRAM Components
Planar SRAM design Techniques
3D implementations of Banked SRAM Arrays
3D implementations of Multiported SRAM arrays
Bank and Array-Stacked 3D SRAM Benefits
Multiported 3D SRAM Benefits
Conclusion
References

2

Anwar,Hossameldin

SLIDE 3

Introduction

 The semiconductor industry faces number of challenges. 1.Poor Scaling of RC delays. 2.Power Consumption. 3.Manufacturing challenges.  3D integration has the potential to address these challenges.  3D integration can reap the advances in traditional planar processes such as double-gate transistors,Tri-gate transistors,finFETs,strained Silicon and metal gates.

3

Anwar,Hossameldin

SLIDE 4

 3D fabrication involves stacking two or more die connected with density and low latency.  The increased density and ability to place and route in 3D provide new

pportunities for microarchitecture design.

 In 3D fabrication, the dense die-to-die enable 3d SRAM components are partitioned at the levels of individual wordlines or bitlines.  So, the benefits are: 1.Reduction of wire length within SRAM arrays. Provides simultaneous latency. Provides energy reduction. 2.Reduction of area footprint. Provides reduction of required wires for global routing.

4

Anwar,Hossameldin

SLIDE 5

3D Technology Process

 There are several proposed methods for 3D integration such as

5

Anwar,Hossameldin

Multilayer buried structures(MLBS) Die bonding

SLIDE 6

 Multi layer buried structure (MLBS) Structure

Multiple device layers are sequentially fabricated in stacked fashion.
Layer-to-layer connections are made from interlayer vias or from direct source-

drain/drain-source contacts.

It uses local polysilicon wires for connection.

Advantage

vertical 3D vias can potentially scale down with feature size.

6

Anwar,Hossameldin

SLIDE 7

 Die bonding Structure

It uses conventional planar fabrication processes and metal vias to bond the planar

die vertically.

Depositing vias on the top metal layers of each of the two die and/or etching vias

through the backside of the die, aligning the two die and bonding them together.

7

Anwar,Hossameldin

SLIDE 8

There are many organizations for multiple die bonding:

Face-to-Face (F2F) bonding.
Face-to-Back (F2B) bonding.
Back-to-Back (B2B) bonding.

8

Anwar,Hossameldin

SLIDE 9

Physical Characteristics of d2d Vias

 The thinning of the die, reduces the distance that d2d via must cross to connect the two die.  A d2d vias is much smaller than the planar interconnect.  It reduces both resistance and capacitance.  So, the signal propagation delay between the two die is reduced.

9

Anwar,Hossameldin

SLIDE 10

Planar SRAM Components

Caches

Basic design parameters

Cache size.
Block size.
Associativity.

Features

Large capacity.

1.caches are organized as banks to increase bandwidth and decrease power consumption. 2.Caches are subbanked to save power by sharing sense amplifier circuitry among subbanks.

Require both tag and data arrays.

1

Anwar,Hossameldin

SLIDE 11

Register files

Features

Lower capacity
Do not have a tag array.
Consist of regular array of 6T SRAM cells.
Typically multiported with multiple read ports and multiple write ports to satisfy the

required bandwidth for data processing.

11

Anwar,Hossameldin

SLIDE 12

Planar SRAM array –based components features

Consists of regular array memory cells.
Easy to partition across a multiple die.
SRAM array are viewed as set of wordlines(horizontally) and set of bitlines(vertically).
Row decoder drives the wordlines and control the access transistors of the data storage

cells.

The bitlines are read by sense amplifier at the bottom of the array.

12

Anwar,Hossameldin

SLIDE 13

Planar SRAM design Techniques

 It used to increase the performance and reduce the power consumption in SRAM arrays.

13

Anwar,Hossameldin

Memory Banking Technique Memory Subbanking Technique Hierarchical Wordline Technique

SLIDE 14

Memory Banking Technique

Power Saving

Divides the memory array into multiple modules(banks).
Accessing only the bank that contains the required data.

Bandwidth Enhancement

If the requested data values located in different banks,
we can simultaneously obtain values out of multiple banks.
Thus, mimicking the effect of a multiported memory array.

But!

If multiple addresses target the same bank, we have a bank conflicts.
So, we need a buffer mechanism that stores and reissues the requests,
So that, the target bank provides the requested data values in later clock cycles.

14

Anwar,Hossameldin

SLIDE 15

Example

Higher order interleaving technique
Divides the memory array into banks based on the higher order address bits.
If the array contains 2^N locations,
One bank contains addresses from 0 to (2^(N-1))-1.
The other bank contains addresses from 2^(N-1) to (2^N)-1.
Lower order interleaving technique
Uses the lower order address bits to identify the banks(odd and even addresses).
If the requested data is located into only one bank, no need to access other banks.
So, it does not consume dynamic power.

15

Anwar,Hossameldin

SLIDE 16

Memory Subbanking Technique

Features

A cache block is divided into a number of subbanks.
The required word is chosen using the offset bits in the address.
The subbank selector selects between the two subbanks and feeds the data from only
ne subbank into the sense amplifier circuitry.
So, a common set of sense amplifiers can be shared across the subbanks.
Data are read out from only one subbank at a time.
Cutting down on the cache power.
Bitline precharge power saving because only the selected subbank needs to be

precharged.

16

Anwar,Hossameldin

SLIDE 17

Hierarchical Wordline Technique(HWL)

Problems

Wordlines are heavily loaded by the access transistors (two per SRAM cell) across the

whole row of SRAM cells.

Wordlines contribute the overall delay of SRAM access.

HWL structure (Solution)

Uses global wordlines(GWL) to drive multiple shorter subwordlines.
The decoder output is used as the global wordline.
So, the wordline loading and latency of driving wordlines are reduced.

Disadvantage

Worsen the wire complexity of the wordlines,the wiring requirement of wordlines is

doubled!!.

17

Anwar,Hossameldin

SLIDE 18

3D Implementations of Banked SRAM Arrays

One option for 3D-integrated SRAM array design is to stack banks on the top of each
thers.
Another option is to split the arrays in multiple layers.
Long metal wires are used to route global signals in banked SRAM arrays.

18

Anwar,Hossameldin

3D Bank Stacking 3D Array Splitting

SLIDE 19

3D Bank Stacking

There are two possible orientations for bank stacking:

Left-to-Right Stacking
Top-to-Down Stacking

Notes

X is the bank width, Y is the bank height.
Assuming that X=Y.
67% reduction in horizontal component of wiring to and from the banks.
The vertical component of the bank wiring is unaffected.
So, the reduction in wire length translates into a reduction of power and delay.

19

Anwar,Hossameldin

SLIDE 20

3D Array Splitting

Features

Partitioning individual rows and columns of the SRAM arrays within a a bank and

stacking them upon themselves.

Can reduce the length of either wordlines or bitlines depending on the orientation of the

split.

20

Anwar,Hossameldin

SLIDE 21

The First Array-split Configuration

Stacks columns on columns
Single long wordline has been replace by a pair of parallel wordlines.
The decoder must drive the wordlines on both of the die.
So, it requires one d2d via per wordline.
At the bottom of the array, the column select multiplexors have been split across the two die.
So, it requires additional d2d vias.
There are reduction in latency and power due to wordlines length reduction.

21

Anwar,Hossameldin

SLIDE 22

The Second Array-split Configuration

Stacks rows on rows.
The row decoder must be partitioned across the two die.
Decompose the 1-to-n decoder into 1-to-2 decoder and two 1-to-n/2 decoders.
The two 1-to-n/2 decoders are stacked on top of each other.
The 1-to-2 decoder will only active to avoid the stacking of thermally active components.
So, the length of the bitlines reduce to half.
There are latency and power reduction due to wire reduction at both the array and bank levels.

22

Anwar,Hossameldin

SLIDE 23

3D Implementation of Multiported SRAM Arrays

There are many possible design for multiported SRAM array in 3D integration technology.

23

Anwar,Hossameldin

Register Partitioning(RP) Bit Partitioning(BP) Port Splitting(PS)

SLIDE 24

Register Partitioning(RP)
For 2-die 3D RP register file implementation(32-entry)
3D RP SRAM array with a 2-die Splits half of the entries and places them on the vertically

stacked die.

As shown in the figure, the bottom die contains resisters R0-R15 and the top die contains

registers R16-R31.

The vertical distance (along the bitlines) has been halved.so, it can reduce latency and power

associated with toggling the bitlines.

The row decoder height has been halved.so, it can reduce the length of the critical path

associated with accessing the farthest entry in the register file.

Also, the overall footprint of the register file has been halved.

24

Anwar,Hossameldin

SLIDE 25

For 4-die 3D RP register file implementation
The register entries are partitioned such that one quarter of the entries entries resides
n each die.
The Row decoder is similar to the 2-die stack.

25

Anwar,Hossameldin

SLIDE 26

Bit Partitioning(BP)
Stacks higher order and lower order bits of the same register across different die.
Can be viewed as BP folds the register file(in horizontal direction) and BP fold the register

file(in vertical direction).

As shown in the figure, the bottom die store the LSBs and the top die stores the MSBs.
ne could store the bits in odd positions on die, store the bits in even positions on other die.
The BP register file reduces the wire length and gate loading of the wordline.
So, it provides both latency and energy benefits.
The BP 3D register file requires the row decoder output to be fanned out to the different die.

26

Anwar,Hossameldin

SLIDE 27

Example

3D integer arithmetic unit partitioned by significance
For one die : X[0:31]+Y[0:31].
The other die : [32:63]+Y[32:63].
So, the the register file BP should be arranged by significance to avoid unnecessary d2d

routing between output of the register file and inputs of the arithmetic unit.

27

Anwar,Hossameldin

SLIDE 28

Port Splitting(PS)
Individual SRAM cells(Caches) are designed to be very small to increase the capacity of the

cache.

The area per bit for a register file is dominated by the wordlines and bitlines for implementing

multiple read and write ports.

The register file SRAM cells have larger footprint (due to the high port count).
So, it provides opportunity to allocate one or two d2d vias per cell.

28

Anwar,Hossameldin

SLIDE 29

Feature

Stacking the wordlines on the top of each other(halves the height) and stacking the bitlines

(halves the width).

50% reduction in both dimensions leads to overall footprint ~75% reduction of SRAM

arrays.

It translates into latency and energy savings.

29

Anwar,Hossameldin

SLIDE 30

Single d2d via per cell configuration

Only single d2d via is used ti route the data bit b to the second die.
On the upper die, an extra inverter recomputes the complement bit b¯ .

What is the limitation of the single via configuration?

Ports on the top die may not be able to optimally support write operations because there is

no path to access the true b¯ storage node.

30

Anwar,Hossameldin

SLIDE 31

Two d2d via per cell configuration

Splits the back-to-back inverters across the two die.
Place all of b bitlines on the bottom die,
Place all of b¯ bitlines on the top die.

Disadvantages 1.Wordlines must be replicated across both die which eliminate the wire length reduction in

ne dimension.

2.Splitting the differential bitlines across more than die may require designing sense amplifiers that are themselves partitioned across more than one die.

31

Anwar,Hossameldin

SLIDE 32

Bank and Array-Stacked 3D SRAM benefits

In Table 1 reports

The size and latency of the baseline planar arrays.
The latency of the 3D array-split circuits.
The relative latency reduction.
The latency in terms of 3 GHz clock cycles.

32

Anwar,Hossameldin

SLIDE 33

33

Anwar,Hossameldin

SLIDE 34

In Table 2 reports

The energy consumed per read access for both the planar and 3D configurations.
The overall savings range from 2-18%,varying due to the differences in the optimal

banking configurations, transistor sizings, and the SRAM aspect ratios

34

Anwar,Hossameldin

SLIDE 35

35

Anwar,Hossameldin

SLIDE 36

36

Anwar,Hossameldin

SLIDE 37

37

Anwar,Hossameldin

SLIDE 38

Multiported 3D SRAM Benefits

38

Anwar,Hossameldin

Effect of Increasing the Number of Entries

Effect of Increasing the Data Widths Effect of Increasing Port Requirements Energy Benefits of Multiported 3D SRAMs

SLIDE 39

Effect of Increasing the Number of Entries In Table 3 reports the planar latencies of each of the register files and percent latency reduction for each of the 3D configurations considered.

39

Anwar,Hossameldin

SLIDE 40

Effect of Increasing the Data Widths

40

Anwar,Hossameldin

SLIDE 41

Effect of Increasing Port requirements

41

Anwar,Hossameldin

SLIDE 42

Energy Benefits of Multiported 3D SRAMs

42

Anwar,Hossameldin

SLIDE 43

Conclusion

3D technology can provide significant benefits in terms of both performance and energy

consumption.

3D technology can be applied to the design of SRAM components to reduce the lengths of

critical wires.

The reduction of the wire length leads to substantial improvement in both latency and

energy characteristics of the 3D SRAM components.

43

Anwar,Hossameldin

SLIDE 44

References

 Puttaswamy, K. ,"3D-Integrated SRAM Components for High-Performance Microprocessors ",IEEE Transactions on Computers,Oct. 2009  Pfitzner, A. ; Kasprowicz, D. ; Emling, R. ; Fischer, T. ; Henzler, S. ; Maly, W. ; Schmitt-Landsiedel, D.,"Stacked 3-dimensional 6T SRAM cell with independent double gate transistors ", IEEE International Conference on IC Design and Technology,May 2009

44

Anwar,Hossameldin