DPG Frhjahrstagung - Darmstadt 17.03.2016 HK 54 Prof. Dr. Volker - - PowerPoint PPT Presentation

dpg fr hjahrstagung darmstadt 17 03 2016 hk 54
SMART_READER_LITE
LIVE PREVIEW

DPG Frhjahrstagung - Darmstadt 17.03.2016 HK 54 Prof. Dr. Volker - - PowerPoint PPT Presentation

The CBM First-level Event Selector, Timeslice Building and Availability Studies Helvi Hartmann hhartmann@fias.uni-frankfurt.de DPG Frhjahrstagung - Darmstadt 17.03.2016 HK 54 Prof. Dr. Volker Lindenstruth FIAS Frankfurt Institute for


slide-1
SLIDE 1

The CBM First-level Event Selector, Timeslice Building and Availability Studies

Helvi Hartmann hhartmann@fias.uni-frankfurt.de

DPG Frühjahrstagung - Darmstadt 17.03.2016 HK 54

CBM

  • Prof. Dr. Volker Lindenstruth

FIAS Frankfurt Institute for Advanced Studies Goethe-Universität Frankfurt am Main, Germany http://compeng.uni-frankfurt.de

1

slide-2
SLIDE 2

hhartmann@fias.uni-frankfurt.de

FIAS Frankfurt Institute

for Advanced Studies

Introduction

2

Timeslice

Timeslice Component

Micro- slice 0/0

  • verlap

MS

100/0

0/1

0/1000 100/ 1000

… . . .

  • CBM detectors are untriggered
  • free streaming data, expected data rate of ~1TB/s
  • Online event reconstruction using timeslices

Challenge

FLES

Compute Node Infiniband Compute Node Input + Compute Node Input + Compute Node

Timeslice Component

Micro- slice

slide-3
SLIDE 3

Compute Node Infiniband Compute Node Input + Compute Node Input + Compute Node

Micro- slice

Timeslice

Timeslice Component

Micro- slice 0/0

  • verlap

MS

100/0

… . . .

1MB

Timeslice Component

Micro- slice 0/1000

  • verlap

MS

100/ 1000

… Timeslice building Reconstruction & analysis Input Interface

3

slide-4
SLIDE 4

hhartmann@fias.uni-frankfurt.de FIAS Frankfurt Institute

for Advanced Studies

Availability

4

  • MTBF - Meant time between failures
  • MTTR - Meant time to repair
slide-5
SLIDE 5

hhartmann@fias.uni-frankfurt.de FIAS Frankfurt Institute

for Advanced Studies

Availability

5

10s 2min 16min 3h 1d 1w 4m 1s 10s 2min 16min 3h 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1

99.9% MTBF MTTR 50%

slide-6
SLIDE 6

hhartmann@fias.uni-frankfurt.de FIAS Frankfurt Institute

for Advanced Studies

1d 1w 4m 2min 16min 3h 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1

Availability

6

estimated MTBF for a crucial node failure is two weeks, extrapolated from real-world data the ALICE-HLT

node failure Restart Framework 30min 5min

slide-7
SLIDE 7

hhartmann@fias.uni-frankfurt.de FIAS Frankfurt Institute

for Advanced Studies

16min 3h 1d 1w 1s 10s 2min 16min 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1

Availability

7

Restart Framework

2d

5min

slide-8
SLIDE 8

?

corrupted Timeslice

Timeslice building Reconstruction Input interface report error use native Infiniband Verbs implementation case: corrupted input data

Compute Node Infiniband Input + Compute Node Compute Node corrupted Timeslice

8

slide-9
SLIDE 9

Compute Node Infiniband Input + Compute Node Timeslice

Timeslice building Reconstruction Input interface report error use native Infiniband Verbs implementation case: process failure

Micro- slice

Compute Node

9

slide-10
SLIDE 10

Compute Node Infiniband Input + Compute Node

Micro- slice

Timeslice

1MB

Timeslice building Reconstruction Input interface Can we use MPI as high-level API instead of low- level native Infiniband Verbs implementation?

10

slide-11
SLIDE 11

hhartmann@fias.uni-frankfurt.de FIAS Frankfurt Institute

for Advanced Studies

11

MPI Fault Tolerance

Child Processes Intracommunicator MPI_COMM_World Parent to Child Intercommunicator

ik

Process wit rank i of generation k

01 11 21 02 12 22 03 13 23

In MPI: when one processes crashes all other processes within the same Communicator crash!

—> not possible to create independent Communicators

slide-12
SLIDE 12

Compute Node Infiniband Input + Compute Node

Micro- slice

Timeslice

1MB

Timeslice building Reconstruction Input interface Control System

start/stop process

  • n each node

detect errors

12

slide-13
SLIDE 13

hhartmann@fias.uni-frankfurt.de FIAS Frankfurt Institute

for Advanced Studies

Conclusion and Outlook

13

  • desired availability of 99.9%
  • higher failure rates during commissioning
  • no more failures than every 2 days

Availability

  • MPI is not fault tolerant
  • use native Infiniband Verbs

implementation for timeslice building

  • add a control software to
  • rchestrate processes and allow

recovery from errors