an and Education for In Industry The SENAI CIMATEC Campus - - PowerPoint PPT Presentation

an and education for in industry the senai cimatec campus
SMART_READER_LITE
LIVE PREVIEW

an and Education for In Industry The SENAI CIMATEC Campus - - PowerPoint PPT Presentation

UNIT Technology, In Innovation an and Education for In Industry The SENAI CIMATEC Campus Highlights 4 buildings More than 35,000 m Over US$200 million of investment 42 competence areas More than 800 employee SENAI CIMATEC


slide-1
SLIDE 1

Technology, In Innovation an and Education for In Industry

UNIT

slide-2
SLIDE 2

Highlights

4 buildings More than 35,000 m² Over US$200 million of investment 42 competence areas More than 800 employee

The SENAI CIMATEC Campus

slide-3
SLIDE 3

SENAI CIMATEC Supercomputing Center

slide-4
SLIDE 4

Supercomputing Center Timeline

2021

  • 1. Yemoja -

Oil & Gás

  • 3. HPC

FINEP

  • 2. Fiocruz

OMOLU

2012 2018 2016 2015

  • 0. Cloud

(Datacenter)

2019 Services Models Projects Researches Masters Specialists PhDs 2020

............................................

50 TFlops 405 TeraFlops 180 TFlops 800 TeraFlops

Areas of Actuation

  • 6. HPC

Industrial CIMATEC

Innovation

  • 7. Quantum
  • 4. HPC Oil &

Gas (CPU)

  • 5. HPC Oil &

Gas (GPU)

1.8 PetaFlops

slide-5
SLIDE 5

+ARM

Intel Xeon

  • 59 Nós Xeon 6148
  • Total: 127 TF

Xeon Phi

  • 4 Nós
  • 8 TFlops

NVidia GPU

  • 2 Nós x 2 P100 Nvlink
  • Total: 13 TF

FPGA

  • 2 x Arria10
  • Total: 2 TFlops

+GPU

CS2I SINAPAD

HPC Ògún: Heterogeneous Computing

slide-6
SLIDE 6

Summary

RTM Brief Review Main Computational Challenges Reducing Memory Requirements Hardware-based Acceleration RTMCore's Architecture Performance Tests Conclusions

Performance and Energy Efficiency Analysis of a Reverse Time Migration Design on FPGA

João Carlos Bittencourt, Joaquim Oliveira, Anderson Nascimento, Rodrigo Tutu, Lauê Jesus, Georgina Rojas, Deusdete Matos, Leonardo Fialho, André Lima, Erick Nascimento, João Marcelo Souza, Adhvan Furtado, and Wagner Oliveira

slide-7
SLIDE 7
  • RTM

TM is is a a Se Seism ismic Mig igration tech echnique for

  • r accu

accurate im imag aging of

  • f subsurfaces

with ith gr great str tructural l an and velo elocity com

  • mplexit

ities

  • Lar

Largel ely use sed in in Se Seis ismic ic Im Imagin ing Flo low for

  • r refin

inin ing boun

  • undarie

ies in in velo elocit ity mod

  • del

l buildin ilding proces esse ses s (F (FWI, PSO SO, , Tom

  • mography, etc.

c.)

Seismogram Data Impulse Source Enhanced Subsurface Image RTM Input Velocity Model

Overview on Reverse Time Migration (RTM)

slide-8
SLIDE 8
  • Project's specific:
  • 2D RTM
  • Point Source and Receiver
  • Second-order acoustic wave
  • Finite-difference based

solution

  • P-waves only

Wave Propagation Geometric Layout Imaging Condition

Overview on Reverse Time Migration (RTM)

slide-9
SLIDE 9
  • RTM requir

ires a a mas assive computation power, memory ry an and storage to

  • mig

igrate even sm small ll fie field lds

  • Fin

Finit ite-difference (St (Stencil il) op

  • perators

require se several l memory ac accesses

  • Mig

igration tim time an and ass associated energy costs may be prohibit itive on

  • n production

sc scale le Main Computational Challenges

slide-10
SLIDE 10
  • Optimization Go

Goals ls:

  • Reducing mem

emory ry req equirements

  • Reducing migration tim

ime an and ene energy consumption

  • De

Design Str Strategy:

  • Cho

Choosin ing mem emory ry efficient al algorithms

  • Op

Optimizin ing mem emory ry ac access

  • Efficient des

design of

  • f he

heterogeneous com

  • mputing ac

accelerators on

  • n FPGA

FPGA an and GPU GPU

Main Computational Challenges

slide-11
SLIDE 11
  • Focus on
  • n bou

boundary ry tr trea eatm tment str trategie ies:

  • Traditional

l Ch Check ck Poin

  • int str

trategy [1] 1]

  • Ra

Random Bo Boundary ry Co Condition (RBC) [2] 2]

  • Hy

Hybrid Bo Boundary ry Co Condition (HB HBC) [3] 3]

  • Du

Duri ring forw

  • rward pr

propagati tion, a a slice of

  • f the

the pr press essure field upper r bo boarder r is saved, for

  • r ea

each tim time step ep

  • On

n bac backward pr propagation, the the bor border slices ar are e us used for

  • r

sou

  • urce wave rec

econstru ruction

  • Tes

est spe specif ificatio ion:

  • Pl

Pluto 2D 2D mo model (6,9 6,960 x 1,20 1,201)

  • Number of
  • f Sho

Shots: 1

  • Tim

Time St Steps: : 12,8 12,860

6960 indexes

2D D Plu luto Velo elocity Mod

  • del

el Bou Boundary ry Con Condit ition Mem

  • em. Req

equirements

St Strategy Required Mem emory (GB) (GB) Imag Image Qu Quality Checkpoint 311.4 High RBC 0.25 Low HBC HBC 1.04 1.04 Hig High

Reducing Memory Requirements

slide-12
SLIDE 12
  • Fix

Fixed-poin int representation

  • Fix

ixed-poin int t op

  • perations gen

enerall lly req equire les less clo clock cy cycle les

  • Word le

length fix fixed in in 24 24 bits its

  • Mem

emory effi ficie iency is is in incr creased

  • HW/SW Valid

alidation

  • A fix

fixed-poin int reference e soft

  • ftware mod
  • del

el was devel eloped and its its ou

  • utp

tputs wer ere ver erif ified ed

1 23 bits 24-bit Fixed-point Numeric Representation*

  • Bit 0 – Sign bit
  • Bits 1-23 – Fraction part

*No Integer part, all values between –1 and 1

Reducing Memory Requirements

slide-13
SLIDE 13
  • Com
  • mple

lete sol solutio ion is s a a hw hw/sw sw co co-design:

  • RTM CP

CPU-based hos

  • st

t applic lication

  • RTM FPGA-based acce

ccele leration kern rnel

  • Th

The e Hos

  • st

t app appli licatio ion is s resp esponsib ible for:

  • r:
  • Con

Config iguring kern ernel l parameters

  • Processin

ing in input t and ou

  • utp

tput t data

  • Dis

Distrib ibuting shots

  • ts among mult

ltiple FPGA

  • Stackin

ing ou

  • utp

tput im images

  • Each ker

ernel l pe perf rform rms an an full full im image migr igratio ion

Hardware-based Acceleration

slide-14
SLIDE 14

Co-design Architecture

slide-15
SLIDE 15
  • Space Par

aralle llelis ism:

  • All

l pr pres essure fi field lds of

  • f the

the sam same tim time step can be be upd updated si simult ltaneously ly

  • Mul

ultip iple le Proc

  • cessing Ele

lements upd update up up to

  • 21

21 pr pres essure poi points ts pe per r iteratio ion

  • Tim

ime Par aralle llelis ism:

  • Con
  • nsecutiv

ive tim time steps can be be com

  • mputed in

in pip pipelin ine

  • A tot
  • tal

l of

  • f 24 cascading Pipeli

lined Stag aged Mod

  • dule

les (P (PSM) str tream tim time it iteratio ions

Space Parallelism Time Parallelism

RTMCore's Architecture

slide-16
SLIDE 16
  • Th

The desi sign model l is is base ased on

  • n research

pres esented in in [4] [4]

Proposed Ker ernel l Architecture

RTMCore's Architecture

slide-17
SLIDE 17
  • Evaluati

tion of

  • f the

the FPG FPGA per erformance ag again inst t tr traditi tional l ac accele lerati tion alt alternativ ives, su such ch as as GP GPU an and Mult ltit ithreadin ing

  • Two asp

aspects ts wer ere consid idered

  • Mig

igratio ion Tim ime: ho how fas ast is s a a seis seismic sho shot migrated?

  • En

Energy efficiency: whic hich acce accelerator de deli livers s mor

  • re

per performance, , while ile req equiring les ess ene energy? Mig igratio ion Tim ime

Tmi

mig = Tcpu + Twrit write + Tread + Tkernel

Co Consumed En Energy

T = = Tmig (Hou

  • ur)

N = = Number of

  • f Power Samples

es P(I) (I) = In Instantaneous Power (W (W)

Performance Evaluation

slide-18
SLIDE 18
  • RTM

TM imple plementatio ions for

  • r perf

performance com

  • mparis

ison: A. A. Seri erial l CPU: : use used as as tar arget reference for

  • r

spe speed up up ana analysis is B. B. Mul ultit ithread CPU: : 40 CPUs s com

  • mputin

ing pr pres essure fi field lds in n par paralle lel for

  • r ea

each tim time step (sp (space pa paralle leli lism) C. C. GPU CUDA: NVi Vidia's Tit Titan X (1 (11 TF TFLOPs) exp xplo lorin ing mas assiv ive spa space par parall llelis ism D. D. FP FPGA: RTM TM ker ernel explo lorin ing bo both spa space and and tim time e par paralle leli lism

Multithread CP CPU NVid idia ia's 's Tit itan X In Intel' l's Arr rria ia 10 10 De

  • Dev. Kit

Kit

Performance Evaluation

slide-19
SLIDE 19

Example: 1 Min. Migration Samples

  • Power Measuring Methodology
  • A po

power meter de devic ice was as pla placed be betw tween po power sup supply an and hos host

  • Bot
  • th hos

host an and de devic ice po power wer ere meas easured du durin ing RTM executio ions

  • Power meter de

devic ice was as con

  • nfig

igured to to coll

  • llect sam

samples at 10H 10Hz

  • On

Only GPU GPU an and FPGA FPGA po power wer ere meas easured Energy Measuring Setup

Performance Evaluation

slide-20
SLIDE 20
  • In

Input Par arameters

  • Plu

Pluto 2D 2D (6,9 (6,960 x x 1,201 1,201)

  • 12,8

12,860 Tim ime ste teps

  • Sho

Shot Pos

  • sit

itio ion: 3,48 3,480 x x 0

  • Number of
  • f Sho

Shots: 1

  • Ov

Overall Wor

  • rkload: 1.4

1.4 GB GB

  • Efficiency measured in

in Sp Speedup/Wh Wh

Performance Results

6,960 indexes Imp Implementatio ion Ru Runtime (s) (s) Spe Speed up up En Energy (W (Wh) Efficiency Serial CPU 21,8 21,873.8 .85 1

  • Multithread

2,429.5 9

  • GPU Titan X

182. 182.7 124 36 3.44 FPG FPGA Arr rria ia 10 10 194 194 112 112 20 20 5.60 5.60

Performance Results

slide-21
SLIDE 21
  • Scala

labil ilit ity of

  • f th

the e solu

  • lution lies

lies in in th the e paralle lelizati tion of

  • f shots
  • ts
  • Multiple FPG

FPGA boa boards in one

  • ne or
  • r mor
  • re com
  • mpute no

nodes

  • Hig

Higher scala labili lity can be e ach chie ieved ed by exp xploring tem emporal parallel elis ism

  • Incr

Increasing the nu number of

  • f Pip

Pipelin ine St Stage Modules

  • Mor
  • re iterations could

ld be be com

  • mputed in pa

parall llel

  • Exp

xploration of

  • f fix

fixed ed-point computati tion

  • Poss
  • ssib

ibility to to explo lore suc such meth thod in n 3D 3D ste tencil il ope

  • perators

22/24

Concluding Remarks

slide-22
SLIDE 22
  • Spee

eedups of

  • f 112x

x can be e ach chie ieved, when en com

  • mpared to
  • a Seq

equen enti tial l CP CPU im implementation

  • GPU

GPU is s on

  • nly

ly 9% 9% fas aster

  • Con

Consid ideration: FPGA FPGA ach achieved suc such a a per performance wit ith 8 8 tim imes s lower fr frequency

  • Alth

lthough th the e des esign ign present lo lower spee eed up com

  • mpared to
  • GPU, ou
  • ur FPGA acce

ccelerator ach chie ieved ed better energy effic ficiency

  • The po

power con

  • nsumption whe

hen com

  • mpared to

to a a GPU GPU ha has s bee been red educed up up to to 55% 55% wit ith an an effic iciency 60% 60% gr greater

Concluding Remarks

slide-23
SLIDE 23

Acknowledgments

22/24

slide-24
SLIDE 24

[1 [1] ] Symes, Will illiam W. . "R "Reverse tim time mig igration with ith op

  • ptimal

l ch checkpoin inting." Ge Geophysic ics 72 72.5 .5 (20 (2007): SM SM213-SM221. [2] [2] Clap lapp, Robert G.

  • G. "R

"Reverse tim time mig igration with ith ran andom bou

  • undaries." Seg

Seg tec echnic ical l pro program expa panded ab abstra racts 20 2009

  • 09. So

Socie iety of

  • f Exploration Ge

Geophysicists, 20 2009

  • 09. 28

2809 09- 28 2813 13. [3] [3] Liu Liu, Hon

  • ngwei,

i, et t al.

  • al. "W

"Wavefie ield ld reconstruction methods s for

  • r reverse tim

time mig igration." Jou Journal l of

  • f Ge

Geophysic ics an and d Eng ngin ineeri ring 10 10.1 .1 (20 (2012): 01 0150 5004. [4] [4] Sa Sano, Kentaro, Yoshia iaki i Hatsuda, an and Sa Satoru Yam

  • amamoto. "M

"Multi-FPGA acc accelerator for sc scala lable stencil l computation with ith con

  • nstant memory ban

andwid idth." IE IEEE Tran ransactions

  • n
  • n Par

arall llel l an and d Di Distrib ibuted Systems 25 25.3 .3 (20 (2013): 69 695-705. 705. References

slide-25
SLIDE 25

Thank You

João Marcelo Silva Souza joao.marcelo@fieb.org.br SENAI CIMATEC FIEB | www.fieb.org.br +55 (071) 3462-8449