[PDF] - TRIPS: A Distributed Explicit Data Graph Execution (EDGE) PDF Document

SLIDE 1

TRIPS: A Distributed Explicit Data Graph Execution (EDGE) Microprocessor

Madhu Saravana Sibi Govindan, Doug Burger, Steve Keckler, and the TRIPS Team

Hot Chips 19, August 2007 Computer Architecture and Technology Laboratory Department of Computer Sciences The University of Texas at Austin www.cs.utexas.edu/users/trips

Hot Chips 19, August 2007 2 The University of Texas at Austin The University of Texas at Austin

Recent Trends

Scaling challenges for conventional superscalar

processors

Power and pipeline limits impede clock rate growth Wire delays cause overheads for concurrency Complexity of large monolithic architectures

Industry shift to multicore architectures

Work well for certain types of workloads Big challenges: how to program/Amdahl’s law Single-thread performance is still important

SLIDE 2

Hot Chips 19, August 2007 3 The University of Texas at Austin The University of Texas at Austin

TRIPS – A Technology Scalable Architecture

Goals
High single-thread performance through ILP
Exploit concurrency at multiple granularities
Scalable with technology trends
Key technologies
Explicit data graph execution (EDGE) ISA
Distributed processor microarchitecture
Distributed non-uniform (NUCA) L2 cache
Tiled and networked design
Hot Chips 2005 - Chip in RTL design phase
Hot Chips 2007 - Chip/system complete
Manufacturing/bring-up complete
Performance tuning in progress
This talk focuses on chip and system

implementation

Hot Chips 19, August 2007 4 The University of Texas at Austin The University of Texas at Austin

Outline

Explicit Data Graph Execution (EDGE) ISAs

ISA support for distributed execution

TRIPS networked microarchitecture

Processor microarchitecture Non-uniform cache memory system

TRIPS chip and system implementation

Custom ASIC chip and system boards Preliminary performance results

SLIDE 3

Hot Chips 19, August 2007 5 The University of Texas at Austin The University of Texas at Austin

Explicit Data Graph Execution (EDGE)

Two key features

Program broken into sequence of instruction blocks

Execution model: fetch, execute, and commit blocks atomically Amortize overheads over many instructions

Within a block, instructions explicitly encode producer-consumer

communication

Producer instructions explicitly targets consumers and send

results directly to them (without using a register file)

Instructions “fire” when all operands arrive Any instruction may be predicated (conditionally executed)

TRIPS blocks - up to 128 instructions

Compile-time techniques to create large blocks
Average block sizes typically greater than 45 instructions
Long-term goal: hide hard-to-predict branches inside blocks

Hot Chips 19, August 2007 6 The University of Texas at Austin The University of Texas at Austin

TRIPS Execution Model

TRIPS block TRIPS block Program Control Flow Graph (CFG)

Basic Block

CFG Loop Example if (p==0) z = a * 2 + 3; else z = b * 3 + 4; teqz muli addi st(1) muli addi

a b p

TRIPS block for if-then-else

if-then-else Example

if-then-else example

SLIDE 4

Hot Chips 19, August 2007 7 The University of Texas at Austin The University of Texas at Austin

TRIPS Execution Model

COMMIT FETCH EXECUTE 5 10 30 40 Block 0 Bi (variable execution time)

Time (cycles)

Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Bi+2 Bi+3 Bi+4 Bi+5 Bi+6 Bi+7

Fetch/Execute/Commit overlapped across multiple blocks
Can execute up to 8 blocks at a time via speculation
Exposes concurrency with a very large instruction window
Single-threaded mode with up to 1024 instructions
Simultaneous multi-threaded (SMT) mode with up to 4 threads and

256 instruction window per thread

Block 8 Bi+8 Block 1 Bi+1

In-flight blocks/instructions Structural dependence Hot Chips 19, August 2007 8 The University of Texas at Austin The University of Texas at Austin

TRIPS Prototype Chip

2 TRIPS Processors
NUCA L2 Cache
1 MB, 16 banks
On-Chip Network (OCN)
2D mesh network
Replaces on-chip bus
Controllers
2 DDR SDRAM controllers
2 DMA controllers
External Bus Controller (EBC)
Interfaces with PowerPC 440GP

(control processor)

Chip-to-Chip (C2C) network controller
Clocking
2 PLLs
4 Clock domains
1x and 2x SDRAM
Main and C2C
Clock tree
Main domain has 4 quadrants to

limit local skew

PROC 0

EBC

PROC 1 OCN

SDC DMA C2C SDC DMA TEST PLLS 108 DDR SDRAM 108 8x39 C2C Links 44 EBI (External Bus Interface) IRQ (Interrupt Request) GPIO 16 CLK DDR SDRAM JTAG

NUCA L2 Cache

SLIDE 5

Hot Chips 19, August 2007 9 The University of Texas at Austin The University of Texas at Austin

TRIPS Microarchitecture Principles

Distributed and tiled architecture

Small and simple tiles (register file, data cache bank, etc.)
Short local wires

Tiles are small: 2-5 mm2 per tile is typical

No centralized resources

Networks connect the tiles

Networks implement distributed protocols (I-fetch, bypass, etc.)

Includes well-defined control and data networks

Networks connect only nearest neighbors
No global wires

Design modularity and scalability

Design productivity by replicating tiles (design reuse)
Networks extensible, even late in design cycle

Hot Chips 19, August 2007 10 The University of Texas at Austin The University of Texas at Austin

TRIPS Tile-level Microarchitecture

TRIPS Tiles

G:Processor control - TLB w/ variable size pages, dispatch, next block predict, commit R: Register file - 32 registers x 4 threads, register forwarding I: Instruction cache - 16KB storage per tile D: Data cache - 8KB per tile, 256-entry load/store queue, TLB E: Execution unit - Int/FP ALUs, 64 reservation stations M: Memory - 64KB, configurable as L2 cache or scratchpad N: OCN network interface - router, translation tables DMA: Direct memory access controller SDC: DDR SDRAM controller EBC: External bus controller - interface to external PowerPC C2C: Chip-to-chip network controller - 4 links to XY neighbors

SLIDE 6

Hot Chips 19, August 2007 11 The University of Texas at Austin The University of Texas at Austin

TRIPS Processor

EDGE ISA blocks mapped to array of tiles

Compile-time scheduler decides where (not when) instructions

execute

TRIPS: aggressive processor capabilities

Up to 16 instructions per cycle
Up to 4 loads/stores per cycle
Up to 64 outstanding L1 data cache misses
Up to 1024 dynamically scheduled instructions
Up to 4 simultaneous multithreading (SMT) threads

Memory system

4 simultaneous L1 cache fills per processor
Up to 16 simultaneous L2 cache accesses
Up to 16 outstanding L2 cache misses

Hot Chips 19, August 2007 12 The University of Texas at Austin The University of Texas at Austin

PROC 0 PROC 1 Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank

Request Reply

Non-Uniform (NUCA) L2 Cache

1MB L2 cache
Sixteen tiled 64KB banks
On-chip network
4x10 2D mesh topology
128-bit links, 366MHz (4.7GB/sec)
4 virtual channels prevent deadlocks
Requests and replies are wormhole-

routed across the network

Up to 10 memory requests per cycle
Up to 128 bytes per cycle returned to

the processors

Individual banks reconfigurable as

scratchpad

SLIDE 7

Hot Chips 19, August 2007 13 The University of Texas at Austin The University of Texas at Austin

Technology trend analysis “What are the problems to solve?” Prototype implementation “Is it feasible?” Prototype testing. “Does it work; how well?” Unveiling

1999 2000 2001 2003 2002 2004 2005 2006 2007

New architecture concepts, invention, publishing “What are the solutions?” Prototype design “Solve detailed challenges of architecture.”

TRIPS Project Timeline

UT-Austin team
12 graduate students + 1

engineer

RTL, verification, timing
IBM ASIC team
Physical design

Hot Chips 19, August 2007 14 The University of Texas at Austin The University of Texas at Austin

170 million Transistor count (est.) 36W at 366MHz (chip has no power mgt.) Power (measured) 1.06 km Total wire length 2.7ns (actual) 4.5ns (worse case sim) Clock period 6.5 million # of routed nets 6.1 million # of placed cells 626 signals, 352 Vdd, 348 GND Pin Count 47mm x 47mm BGA Package 18.3mm x 18.37mm (336 mm2) Die Size 130nm ASIC with 7 metal layers Process Technology

TRIPS Chip Implementation

SLIDE 8

Hot Chips 19, August 2007 15 The University of Texas at Austin The University of Texas at Austin

Die Photos

With C4 Array Without C4 Array

Hot Chips 19, August 2007 16 The University of Texas at Austin The University of Texas at Austin

Benefits of Tiled Design

Design modularity

11 different tiles, instantiated a total of 106 times Clean interfaces at tile boundaries

Verification - no hardware bugs

Tiles verified extensively before stitching together

Place and Route - hierarchical in nature

Wiring only between nearest neighbors But - each physical instance was a little different

Timing - trivial at top level, communication planned

No global wires or timing paths

SLIDE 9

Hot Chips 19, August 2007 17 The University of Texas at Austin The University of Texas at Austin

TRIPS Daughtercard

12V to daughtercard, stepped

down to 1.5V, 2.5V, and 3.3V

12 GFlops peak
45 W total (worst case)

Voltage Regulator Heatsink/Fan (rated to 90 W) 2 x 1GB DIMMs

Hot Chips 19, August 2007 18 The University of Texas at Austin The University of Texas at Austin

TRIPS Motherboard

1 motherboard includes:

4 daughter-boards
4 TRIPS chips
8 GBytes DRAM
PowerPC 440GP

control processor

I/O: ethernet, serial,

C2C links

FPGA I/O interface

Peak performance

48 GFlops at 366 MHz
180 Watts

SLIDE 10

Hot Chips 19, August 2007 19 The University of Texas at Austin The University of Texas at Austin

TRIPS Multi-board System

Extensible to 8 boards (64 processors)
Micro-coax cables extend C2C network across boards
Peak performance
380 GFlops, 1.4KW
Parallel message-passing software

Hot Chips 19, August 2007 20 The University of Texas at Austin The University of Texas at Austin

TRIPS Prototype System Layers

P 1 3 2 HOST PC x86 Linux

PowerPC 440GP

EBC

TRIPS Resource

Manager (TRM)

File system Runtime services Login/debug/etc. Local Resoure Manager

(LRM) listens to HostPC

Runs embedded Linux PowerPC EBI device driver

to control TRIPS chips

PowerPC EBITRIPS EBC

Board 0 Board 1 Board 2

Runs TRIPS apps Interrupts PowerPC

if necessary

System calls,

exceptions

Ethernet Switch

EBC EBC EBC

IRQ EBI

SLIDE 11

Hot Chips 19, August 2007 21 The University of Texas at Austin The University of Texas at Austin

250 nm 100 MHz SDRAM 450 MHz Pentium 3 90 nm 533 MHz DDR2 3.6 GHz Pentium 4 65 nm 800 MHz DDR2 1.6 GHz (underclocked) Core 2 130 nm 200 MHz DDR 366 MHz TRIPS Process Technology Memory Speed Clock Speed Processor

Preliminary Performance (HW)

Challenges
Different technology and ISAs
Different processor-to-memory

clock ratio

TRIPS compiler fine-tuning in

progress

Cycle-to-cycle comparison on multiple

HW platforms

TRIPS Performance counters
PAPI - Performance API on Linux

systems for others

Applications
Compiled + hand-optimized
Mix of kernels and full

algorithms

Compiled only
The Embedded

Microprocessor Benchmark Consortium (EEMBC)

Versabench (MIT)
SPEC benchmarks in progress

Hot Chips 19, August 2007 22 The University of Texas at Austin The University of Texas at Austin

TRIPS vs. Conventional Processors: Kernels

BIT INT FLOAT STREAM

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 802.11a 8b10b a2time rspeed vadd conv matrix autocor ct fmradio Geometric Mean

Speedup Relative to Core 2 (cycles) TRIPS-Hand

ptimized

TRIPS-Compiled Pentium 3 Pentium 4 Core 2

SLIDE 12

Hot Chips 19, August 2007 23 The University of Texas at Austin The University of Texas at Austin

TRIPS vs. Conventional Processors EEMBC and signal processing (compiled)

0.5 1 1.5 2 2.5

aifftr aiifft basefp cacheb iirflt matrix pntrch puwmod rspeed

spf

bezier text autocor fft conv ct Geometric Mean

Graph shows representative subset of 33 benchmarks and geometric mean for all 33 benchmarks Speedup Relative to Core 2 (cycles)

TRIPS-Compiled Pentium 3 Pentium 4

Core 2 Hot Chips 19, August 2007 24 The University of Texas at Austin The University of Texas at Austin

Summary

TRIPS prototype demonstrates feasibility of:

Explicit Data Graph Execution (EDGE) ISAs
Distributed processor and memory microarchitectures
Scaled and tiled uniprocessors
Non-uniform cache architectures (NUCA)
Recompilation required, but no change to source code

Performance is promising

First generation TRIPS is 3.1x faster for hand-optimized code
Compiled-code reasonable and improving

Identified several opportunities for improvements

Enhanced operand networks
Dynamic aggregation of processor cores

TRIPS-like architectures are a potential “alternative” to

multicore

SLIDE 13

Hot Chips 19, August 2007 25 The University of Texas at Austin The University of Texas at Austin

Acknowledgements

TRIPS Hardware Team

Raj Desikan, Saurabh Drolia, Madhu Sibi Govindan, Divya

Gulati, Paul Gratz, Heather Hanson, Changkyu Kim, Haiming Liu, Robert McDonald, Ramdas Nagarajan, Nitya Ranganathan, Karu Sankaralingam, Simha Sethumadhavan, Premkishore Shivakumar

TRIPS Software Team

Kathryn McKinley, Jim Burrill, Xia Chen, Sundeep Kushwaha,

Bert Maher, Nick Nethercote, Suriya Narayanan, Sadia Sharif, Aaron Smith, Bill Yoder

IBM Microelectronics Austin ASIC Group TRIPS Sponsors

DARPA Polymorphous Computing Architectures
Air Force Research Laboratories
National Science Foundation
IBM, Intel, Sun Microsystems