TRIPS: A Distributed Explicit Data Graph Execution (EDGE) - - PDF document

trips a distributed explicit data graph execution edge
SMART_READER_LITE
LIVE PREVIEW

TRIPS: A Distributed Explicit Data Graph Execution (EDGE) - - PDF document

TRIPS: A Distributed Explicit Data Graph Execution (EDGE) Microprocessor Madhu Saravana Sibi Govindan, Doug Burger, Steve Keckler, and the TRIPS Team Hot Chips 19, August 2007 Computer Architecture and Technology Laboratory Department of


slide-1
SLIDE 1

TRIPS: A Distributed Explicit Data Graph Execution (EDGE) Microprocessor

Madhu Saravana Sibi Govindan, Doug Burger, Steve Keckler, and the TRIPS Team

Hot Chips 19, August 2007 Computer Architecture and Technology Laboratory Department of Computer Sciences The University of Texas at Austin www.cs.utexas.edu/users/trips

Hot Chips 19, August 2007 2 The University of Texas at Austin The University of Texas at Austin

Recent Trends

Scaling challenges for conventional superscalar

processors

Power and pipeline limits impede clock rate growth Wire delays cause overheads for concurrency Complexity of large monolithic architectures

Industry shift to multicore architectures

Work well for certain types of workloads Big challenges: how to program/Amdahl’s law Single-thread performance is still important

slide-2
SLIDE 2

Hot Chips 19, August 2007 3 The University of Texas at Austin The University of Texas at Austin

TRIPS – A Technology Scalable Architecture

  • Goals
  • High single-thread performance through ILP
  • Exploit concurrency at multiple granularities
  • Scalable with technology trends
  • Key technologies
  • Explicit data graph execution (EDGE) ISA
  • Distributed processor microarchitecture
  • Distributed non-uniform (NUCA) L2 cache
  • Tiled and networked design
  • Hot Chips 2005 - Chip in RTL design phase
  • Hot Chips 2007 - Chip/system complete
  • Manufacturing/bring-up complete
  • Performance tuning in progress
  • This talk focuses on chip and system

implementation

Hot Chips 19, August 2007 4 The University of Texas at Austin The University of Texas at Austin

Outline

Explicit Data Graph Execution (EDGE) ISAs

ISA support for distributed execution

TRIPS networked microarchitecture

Processor microarchitecture Non-uniform cache memory system

TRIPS chip and system implementation

Custom ASIC chip and system boards Preliminary performance results

slide-3
SLIDE 3

Hot Chips 19, August 2007 5 The University of Texas at Austin The University of Texas at Austin

Explicit Data Graph Execution (EDGE)

Two key features

  • Program broken into sequence of instruction blocks

Execution model: fetch, execute, and commit blocks atomically Amortize overheads over many instructions

  • Within a block, instructions explicitly encode producer-consumer

communication

Producer instructions explicitly targets consumers and send

results directly to them (without using a register file)

Instructions “fire” when all operands arrive Any instruction may be predicated (conditionally executed)

TRIPS blocks - up to 128 instructions

  • Compile-time techniques to create large blocks
  • Average block sizes typically greater than 45 instructions
  • Long-term goal: hide hard-to-predict branches inside blocks

Hot Chips 19, August 2007 6 The University of Texas at Austin The University of Texas at Austin

TRIPS Execution Model

TRIPS block TRIPS block Program Control Flow Graph (CFG)

Basic Block

CFG Loop Example if (p==0) z = a * 2 + 3; else z = b * 3 + 4; teqz muli addi st(1) muli addi

a b p

TRIPS block for if-then-else

if-then-else Example

if-then-else example

slide-4
SLIDE 4

Hot Chips 19, August 2007 7 The University of Texas at Austin The University of Texas at Austin

TRIPS Execution Model

COMMIT FETCH EXECUTE 5 10 30 40 Block 0 Bi (variable execution time)

Time (cycles)

Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Bi+2 Bi+3 Bi+4 Bi+5 Bi+6 Bi+7

  • Fetch/Execute/Commit overlapped across multiple blocks
  • Can execute up to 8 blocks at a time via speculation
  • Exposes concurrency with a very large instruction window
  • Single-threaded mode with up to 1024 instructions
  • Simultaneous multi-threaded (SMT) mode with up to 4 threads and

256 instruction window per thread

Block 8 Bi+8 Block 1 Bi+1

In-flight blocks/instructions Structural dependence Hot Chips 19, August 2007 8 The University of Texas at Austin The University of Texas at Austin

TRIPS Prototype Chip

  • 2 TRIPS Processors
  • NUCA L2 Cache
  • 1 MB, 16 banks
  • On-Chip Network (OCN)
  • 2D mesh network
  • Replaces on-chip bus
  • Controllers
  • 2 DDR SDRAM controllers
  • 2 DMA controllers
  • External Bus Controller (EBC)
  • Interfaces with PowerPC 440GP

(control processor)

  • Chip-to-Chip (C2C) network controller
  • Clocking
  • 2 PLLs
  • 4 Clock domains
  • 1x and 2x SDRAM
  • Main and C2C
  • Clock tree
  • Main domain has 4 quadrants to

limit local skew

PROC 0

EBC

PROC 1 OCN

SDC DMA C2C SDC DMA TEST PLLS 108 DDR SDRAM 108 8x39 C2C Links 44 EBI (External Bus Interface) IRQ (Interrupt Request) GPIO 16 CLK DDR SDRAM JTAG

NUCA L2 Cache

slide-5
SLIDE 5

Hot Chips 19, August 2007 9 The University of Texas at Austin The University of Texas at Austin

TRIPS Microarchitecture Principles

Distributed and tiled architecture

  • Small and simple tiles (register file, data cache bank, etc.)
  • Short local wires

Tiles are small: 2-5 mm2 per tile is typical

  • No centralized resources

Networks connect the tiles

  • Networks implement distributed protocols (I-fetch, bypass, etc.)

Includes well-defined control and data networks

  • Networks connect only nearest neighbors
  • No global wires

Design modularity and scalability

  • Design productivity by replicating tiles (design reuse)
  • Networks extensible, even late in design cycle

Hot Chips 19, August 2007 10 The University of Texas at Austin The University of Texas at Austin

TRIPS Tile-level Microarchitecture

TRIPS Tiles

G:Processor control - TLB w/ variable size pages, dispatch, next block predict, commit R: Register file - 32 registers x 4 threads, register forwarding I: Instruction cache - 16KB storage per tile D: Data cache - 8KB per tile, 256-entry load/store queue, TLB E: Execution unit - Int/FP ALUs, 64 reservation stations M: Memory - 64KB, configurable as L2 cache or scratchpad N: OCN network interface - router, translation tables DMA: Direct memory access controller SDC: DDR SDRAM controller EBC: External bus controller - interface to external PowerPC C2C: Chip-to-chip network controller - 4 links to XY neighbors

slide-6
SLIDE 6

Hot Chips 19, August 2007 11 The University of Texas at Austin The University of Texas at Austin

TRIPS Processor

EDGE ISA blocks mapped to array of tiles

  • Compile-time scheduler decides where (not when) instructions

execute

TRIPS: aggressive processor capabilities

  • Up to 16 instructions per cycle
  • Up to 4 loads/stores per cycle
  • Up to 64 outstanding L1 data cache misses
  • Up to 1024 dynamically scheduled instructions
  • Up to 4 simultaneous multithreading (SMT) threads

Memory system

  • 4 simultaneous L1 cache fills per processor
  • Up to 16 simultaneous L2 cache accesses
  • Up to 16 outstanding L2 cache misses

Hot Chips 19, August 2007 12 The University of Texas at Austin The University of Texas at Austin

PROC 0 PROC 1 Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank Bank

Request Reply

Non-Uniform (NUCA) L2 Cache

  • 1MB L2 cache
  • Sixteen tiled 64KB banks
  • On-chip network
  • 4x10 2D mesh topology
  • 128-bit links, 366MHz (4.7GB/sec)
  • 4 virtual channels prevent deadlocks
  • Requests and replies are wormhole-

routed across the network

  • Up to 10 memory requests per cycle
  • Up to 128 bytes per cycle returned to

the processors

  • Individual banks reconfigurable as

scratchpad

slide-7
SLIDE 7

Hot Chips 19, August 2007 13 The University of Texas at Austin The University of Texas at Austin

Technology trend analysis “What are the problems to solve?” Prototype implementation “Is it feasible?” Prototype testing. “Does it work; how well?” Unveiling

1999 2000 2001 2003 2002 2004 2005 2006 2007

New architecture concepts, invention, publishing “What are the solutions?” Prototype design “Solve detailed challenges of architecture.”

TRIPS Project Timeline

  • UT-Austin team
  • 12 graduate students + 1

engineer

  • RTL, verification, timing
  • IBM ASIC team
  • Physical design

Hot Chips 19, August 2007 14 The University of Texas at Austin The University of Texas at Austin

170 million Transistor count (est.) 36W at 366MHz (chip has no power mgt.) Power (measured) 1.06 km Total wire length 2.7ns (actual) 4.5ns (worse case sim) Clock period 6.5 million # of routed nets 6.1 million # of placed cells 626 signals, 352 Vdd, 348 GND Pin Count 47mm x 47mm BGA Package 18.3mm x 18.37mm (336 mm2) Die Size 130nm ASIC with 7 metal layers Process Technology

TRIPS Chip Implementation

slide-8
SLIDE 8

Hot Chips 19, August 2007 15 The University of Texas at Austin The University of Texas at Austin

Die Photos

With C4 Array Without C4 Array

Hot Chips 19, August 2007 16 The University of Texas at Austin The University of Texas at Austin

Benefits of Tiled Design

Design modularity

11 different tiles, instantiated a total of 106 times Clean interfaces at tile boundaries

Verification - no hardware bugs

Tiles verified extensively before stitching together

Place and Route - hierarchical in nature

Wiring only between nearest neighbors But - each physical instance was a little different

Timing - trivial at top level, communication planned

No global wires or timing paths

slide-9
SLIDE 9

Hot Chips 19, August 2007 17 The University of Texas at Austin The University of Texas at Austin

TRIPS Daughtercard

  • 12V to daughtercard, stepped

down to 1.5V, 2.5V, and 3.3V

  • 12 GFlops peak
  • 45 W total (worst case)

Voltage Regulator Heatsink/Fan (rated to 90 W) 2 x 1GB DIMMs

Hot Chips 19, August 2007 18 The University of Texas at Austin The University of Texas at Austin

TRIPS Motherboard

1 motherboard includes:

  • 4 daughter-boards
  • 4 TRIPS chips
  • 8 GBytes DRAM
  • PowerPC 440GP

control processor

  • I/O: ethernet, serial,

C2C links

  • FPGA I/O interface

Peak performance

  • 48 GFlops at 366 MHz
  • 180 Watts
slide-10
SLIDE 10

Hot Chips 19, August 2007 19 The University of Texas at Austin The University of Texas at Austin

TRIPS Multi-board System

  • Extensible to 8 boards (64 processors)
  • Micro-coax cables extend C2C network across boards
  • Peak performance
  • 380 GFlops, 1.4KW
  • Parallel message-passing software

Hot Chips 19, August 2007 20 The University of Texas at Austin The University of Texas at Austin

TRIPS Prototype System Layers

P 1 3 2 HOST PC x86 Linux

PowerPC 440GP

EBC

TRIPS Resource

Manager (TRM)

File system Runtime services Login/debug/etc. Local Resoure Manager

(LRM) listens to HostPC

Runs embedded Linux PowerPC EBI device driver

to control TRIPS chips

PowerPC EBITRIPS EBC

Board 0 Board 1 Board 2

Runs TRIPS apps Interrupts PowerPC

if necessary

System calls,

exceptions

Ethernet Switch

EBC EBC EBC

IRQ EBI

slide-11
SLIDE 11

Hot Chips 19, August 2007 21 The University of Texas at Austin The University of Texas at Austin

250 nm 100 MHz SDRAM 450 MHz Pentium 3 90 nm 533 MHz DDR2 3.6 GHz Pentium 4 65 nm 800 MHz DDR2 1.6 GHz (underclocked) Core 2 130 nm 200 MHz DDR 366 MHz TRIPS Process Technology Memory Speed Clock Speed Processor

Preliminary Performance (HW)

  • Challenges
  • Different technology and ISAs
  • Different processor-to-memory

clock ratio

  • TRIPS compiler fine-tuning in

progress

  • Cycle-to-cycle comparison on multiple

HW platforms

  • TRIPS Performance counters
  • PAPI - Performance API on Linux

systems for others

  • Applications
  • Compiled + hand-optimized
  • Mix of kernels and full

algorithms

  • Compiled only
  • The Embedded

Microprocessor Benchmark Consortium (EEMBC)

  • Versabench (MIT)
  • SPEC benchmarks in progress

Hot Chips 19, August 2007 22 The University of Texas at Austin The University of Texas at Austin

TRIPS vs. Conventional Processors: Kernels

BIT INT FLOAT STREAM

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 802.11a 8b10b a2time rspeed vadd conv matrix autocor ct fmradio Geometric Mean

Speedup Relative to Core 2 (cycles) TRIPS-Hand

  • ptimized

TRIPS-Compiled Pentium 3 Pentium 4 Core 2

slide-12
SLIDE 12

Hot Chips 19, August 2007 23 The University of Texas at Austin The University of Texas at Austin

TRIPS vs. Conventional Processors EEMBC and signal processing (compiled)

0.5 1 1.5 2 2.5

aifftr aiifft basefp cacheb iirflt matrix pntrch puwmod rspeed

  • spf

bezier text autocor fft conv ct Geometric Mean

Graph shows representative subset of 33 benchmarks and geometric mean for all 33 benchmarks Speedup Relative to Core 2 (cycles)

TRIPS-Compiled Pentium 3 Pentium 4

Core 2 Hot Chips 19, August 2007 24 The University of Texas at Austin The University of Texas at Austin

Summary

TRIPS prototype demonstrates feasibility of:

  • Explicit Data Graph Execution (EDGE) ISAs
  • Distributed processor and memory microarchitectures
  • Scaled and tiled uniprocessors
  • Non-uniform cache architectures (NUCA)
  • Recompilation required, but no change to source code

Performance is promising

  • First generation TRIPS is 3.1x faster for hand-optimized code
  • Compiled-code reasonable and improving

Identified several opportunities for improvements

  • Enhanced operand networks
  • Dynamic aggregation of processor cores

TRIPS-like architectures are a potential “alternative” to

multicore

slide-13
SLIDE 13

Hot Chips 19, August 2007 25 The University of Texas at Austin The University of Texas at Austin

Acknowledgements

TRIPS Hardware Team

  • Raj Desikan, Saurabh Drolia, Madhu Sibi Govindan, Divya

Gulati, Paul Gratz, Heather Hanson, Changkyu Kim, Haiming Liu, Robert McDonald, Ramdas Nagarajan, Nitya Ranganathan, Karu Sankaralingam, Simha Sethumadhavan, Premkishore Shivakumar

TRIPS Software Team

  • Kathryn McKinley, Jim Burrill, Xia Chen, Sundeep Kushwaha,

Bert Maher, Nick Nethercote, Suriya Narayanan, Sadia Sharif, Aaron Smith, Bill Yoder

IBM Microelectronics Austin ASIC Group TRIPS Sponsors

  • DARPA Polymorphous Computing Architectures
  • Air Force Research Laboratories
  • National Science Foundation
  • IBM, Intel, Sun Microsystems