Litmus Testing at Rack Scale We're Going to Build a Large Program - - PowerPoint PPT Presentation

litmus testing at rack scale we re going to build a large
SMART_READER_LITE
LIVE PREVIEW

Litmus Testing at Rack Scale We're Going to Build a Large Program - - PowerPoint PPT Presentation

Litmus Testing at Rack Scale We're Going to Build a Large Program Collider ad Collide instructions at 0.99 c , and observe the decay products. Images: CERN; Chaix & Morel et associs David Cock | 19. September 20 | 2 16 Programmers


slide-1
SLIDE 1

Litmus Testing at Rack Scale

slide-2
SLIDE 2
  • 19. September 20

16 David Cock 2 | |

Collide instructions at 0.99c, and observe the decay products.

We're Going to Build a Large Program Collider

Images: CERN; Chaix & Morel et associés

ad

slide-3
SLIDE 3
  • 19. September 20

16 David Cock 3 | |

Programmers Once (Thought They) Understood Computer Architecture

Image: Computer Systems, A Programmer's Perspective, Bryant & O'Hallaron, 2011

slide-4
SLIDE 4
  • 19. September 20

16 David Cock 4 | |

Symmetric Multiprocessors Were Fairly Simple

WB WB Cache Cache RAM

slide-5
SLIDE 5
  • 19. September 20

16 David Cock 5 | |

Concurrent Code Makes Architecture Visible

Consider message passing.

Pretty much the simplest thing you can do with shared memory.

Systems like Barrelfish rely on it.

When are barriers required?

You can't write good code, without sufficiently understanding the hardware.

We're combining components in new ways.

slide-6
SLIDE 6
  • 19. September 20

16 David Cock 6 | |

Message Passing with Shared Memory

CPU RAM CPU Write: *x = 42 Read: *x = 42 *x = 0 *x = 42 *y = 1 *y = 0 Write: *y = 1 Read: *y = 1

slide-7
SLIDE 7
  • 19. September 20

16 David Cock 7 | |

Message Passing with a Write Buffer

CPU RAM CPU Write: *x = 42 Read: *x = 0 *x = 0 *x = 42 *y = 1 *y = 0 Write: *y = 1 Read: *y = 1 WB *y = 1

slide-8
SLIDE 8
  • 19. September 20

16 David Cock 8 | |

Message Passing with a Barrier

CPU RAM CPU Write: *x = 42 Read: *x = 42 *x = 0 *x = 42 *y = 1 *y = 0 Write: *y = 1 Read: *y = 1 WB *y = 1 *x = 42

slide-9
SLIDE 9
  • 19. September 20

16 David Cock 9 | |

Of Course, CPUs Aren't That Simple

CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 RAM Coherent Interconnect PCI 9 hops

slide-10
SLIDE 10
  • 19. September 20

16 David Cock 10 | |

You Can't Trust the Hardware

seL4 was verified modulo a hardware model.

The Cortex A8 has bugs:

Cache flushes don't work.

As of today, these “errata” are still not public.

We rediscovered these by accident.

Non-coherent memory is coming.

Source: Chip Errata for the i.MX51, Freescale Semiconductor

slide-11
SLIDE 11
  • 19. September 20

16 David Cock 11 | |

And Then There's Rack Scale...

CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI PCI NIC NIC NIC NIC NIC NIC NIC NIC

TOR TOR Backhaul

slide-12
SLIDE 12
  • 19. September 20

16 David Cock 12 | |

There's a Lot of Data Available

CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI PCI NIC NIC NIC NIC NIC NIC NIC NIC

TOR TOR Backhaul Program trace Cache dumps Port mirroring Openflow Event triggers

slide-13
SLIDE 13
  • 19. September 20

16 David Cock 13 | |

ARM High-Speed Serial Trace Port

Streams from the Embedded Trace Macrocell.

Cycle-accurate control flow + events @ 6GiB/s+

Compatible with FPGA PHYs.

Well-documented protocol.

Available on ARMv8

Image: Teledyne Lecroy

slide-14
SLIDE 14
  • 19. September 20

16 David Cock 14 | |

The HSSTP Hardware

The official tool is CHF10,000 per core.

The cable run is maximum 15cm.

It's PHY-compatible with common FPGAs

A CHF6k FGPA could easily handle 10 – 15x cheaper!

We're working with the D-ITET DZ on an interface board.

If you like soldering, let us know!

slide-15
SLIDE 15
  • 19. September 20

16 David Cock 15 | |

Fancy Triggering and Filtering

The ETM has sophisticated filtering e.g. Sequencer.

Bn and Fn can be just about any events on the SoC.

States can enable/disable trace,

  • r log events.

A powerful facility for pre-filtering

State 0 State 1 State 2 State 3 B2 B1 B0 F0 F1 F2

slide-16
SLIDE 16
  • 19. September 20

16 David Cock 16 | |

Filtering and Offload in an FPGA

We'll need to intelligently filter high-rate data.

We're using an FPGA for the physical interface already.

How much processing could we do?

We have expertise in the group with FPGA query offloading

Zsolt and I are writing a joint Master's project proposal on this.

slide-17
SLIDE 17
  • 19. September 20

16 David Cock 17 | |

Hardware Tracing for Correctness

unmap(pa); cleanDCache(); flushTLB();

Are HW operatjons right?

5Gb/s Filter at line rate Check temporal assertjons Log & process offmine

  • Real time pipeline trace on ARM.
  • Can halt and inspect caches.
  • HW has “errata” (bugs).
  • Check that it actually works!
  • Catch transient and race bugs.
slide-18
SLIDE 18
  • 19. September 20

16 David Cock 18 | |

Hardware Tracing for Performance

5Gb/s Filter at line rate Log & process offmine

URPC[0]= x; URPC[1]= 1; while(!URPC[1]); x= URPC[0];

1 2

x 1 x

Core 0 Core 1

Cache 0 Cache 1

INVAL(0) READ(1) …

Is URPC optjmal?

  • Should see N coherency messages.
  • Do we?

‐ The HW knows!

slide-19
SLIDE 19
  • 19. September 20

16 David Cock 19 | |

Properties to Check: Security

Runtime verification is an established field.

Lots of existing work to build on.

What properties could we check efficiently?

How could we map them to the filtering pipeline?

/* A very simple TESLA assertion. */ TESLA_WITHIN(example_syscall, previously(security_check(ANY(ptr),

  • , op) == 0));

http://www.cl.cam.ac.uk/research/security/ctsrd/tesla/

slide-20
SLIDE 20
  • 19. September 20

16 David Cock 20 | |

Processing Engine

That's a lot of data, how can we process it?

This is what rack-scale systems are for!

Andrei is starting on this as his Master's project.

slide-21
SLIDE 21
  • 19. September 20

16 David Cock 21 | |

Properties to Check: Memory Management

Could we check this?

We don't have data values (a & b).

We can play clever tricks with the hardware!

Shows what we could do with data trace.

void *a = malloc(); ... free(b); {a = b}

PROCID= b[15:0]; PROCID= b[31:16]; CID: B[15:0] ++ ASID CID: B[31:16] ++ ASID

slide-22
SLIDE 22
  • 19. September 20

16 David Cock 22 | |

A Streaming Verification Engine

HSSTP Packet Capture

Sources Capture Processing Properties

ETM Sequencer FPGA Capture Dataflow Engine FPGA Offload TESLA malloc() pairing Coherence correctness Constraints Requirements

slide-23
SLIDE 23
  • 19. September 20

16 David Cock 23 | |

Offloading Example: LTL to Büchi

  • LTL(-ish) formula: A store on core 1 is

eventually visible on core 2.

  • Think regular expressions for infinite

streams.

  • As for REs, we compile a checking

automaton.

  • Run the automaton in real time and

look for violations.

  • FPGAs are good at state machines.
slide-24
SLIDE 24
  • 19. September 20

16 David Cock 24 | |

An Instrumented Rack-Scale System

  • 64 SoCs x 5Gb/s = 320Gb/s trace output.
  • Online checkers (e.g. automata) will be

essential at this scale.

  • We're going to build this:

– A rack of ARMv8 cores & FPGAs.

  • We're starting a fortnightly reading group to

get up to speed on the Runtime Monitoring literature – feel free to join.

https://code.systems.ethz.ch/project/view/55/ rack-tracing@lists.inf.ethz.ch