[PPT] - an and Education for In Industry The SENAI CIMATEC Campus PowerPoint Presentation

SLIDE 1

Technology, In Innovation an and Education for In Industry

UNIT

SLIDE 2

Highlights

4 buildings More than 35,000 m² Over US$200 million of investment 42 competence areas More than 800 employee

The SENAI CIMATEC Campus

SLIDE 3

SENAI CIMATEC Supercomputing Center

SLIDE 4

Supercomputing Center Timeline

2021

1. Yemoja -

Oil & Gás

3. HPC

FINEP

2. Fiocruz

OMOLU

2012 2018 2016 2015

0. Cloud

(Datacenter)

2019 Services Models Projects Researches Masters Specialists PhDs 2020

............................................

50 TFlops 405 TeraFlops 180 TFlops 800 TeraFlops

Areas of Actuation

6. HPC

Industrial CIMATEC

Innovation

7. Quantum
4. HPC Oil &

Gas (CPU)

5. HPC Oil &

Gas (GPU)

1.8 PetaFlops

SLIDE 5

+ARM

Intel Xeon

59 Nós Xeon 6148
Total: 127 TF

Xeon Phi

4 Nós
8 TFlops

NVidia GPU

2 Nós x 2 P100 Nvlink
Total: 13 TF

FPGA

2 x Arria10
Total: 2 TFlops

+GPU

CS2I SINAPAD

HPC Ògún: Heterogeneous Computing

SLIDE 6

Summary

RTM Brief Review Main Computational Challenges Reducing Memory Requirements Hardware-based Acceleration RTMCore's Architecture Performance Tests Conclusions

Performance and Energy Efficiency Analysis of a Reverse Time Migration Design on FPGA

João Carlos Bittencourt, Joaquim Oliveira, Anderson Nascimento, Rodrigo Tutu, Lauê Jesus, Georgina Rojas, Deusdete Matos, Leonardo Fialho, André Lima, Erick Nascimento, João Marcelo Souza, Adhvan Furtado, and Wagner Oliveira

SLIDE 7

RTM

TM is is a a Se Seism ismic Mig igration tech echnique for

r accu

accurate im imag aging of

f subsurfaces

with ith gr great str tructural l an and velo elocity com

mplexit

ities

Lar

Largel ely use sed in in Se Seis ismic ic Im Imagin ing Flo low for

r refin

inin ing boun

undarie

ies in in velo elocit ity mod

del

l buildin ilding proces esse ses s (F (FWI, PSO SO, , Tom

mography, etc.

c.)

Seismogram Data Impulse Source Enhanced Subsurface Image RTM Input Velocity Model

Overview on Reverse Time Migration (RTM)

SLIDE 8

Project's specific:
2D RTM
Point Source and Receiver
Second-order acoustic wave
Finite-difference based

solution

P-waves only

Wave Propagation Geometric Layout Imaging Condition

Overview on Reverse Time Migration (RTM)

SLIDE 9

RTM requir

ires a a mas assive computation power, memory ry an and storage to

mig

igrate even sm small ll fie field lds

Fin

Finit ite-difference (St (Stencil il) op

perators

require se several l memory ac accesses

Mig

igration tim time an and ass associated energy costs may be prohibit itive on

n production

sc scale le Main Computational Challenges

SLIDE 10

Optimization Go

Goals ls:

Reducing mem

emory ry req equirements

Reducing migration tim

ime an and ene energy consumption

De

Design Str Strategy:

Cho

Choosin ing mem emory ry efficient al algorithms

Op

Optimizin ing mem emory ry ac access

Efficient des

design of

f he

heterogeneous com

mputing ac

accelerators on

n FPGA

FPGA an and GPU GPU

Main Computational Challenges

SLIDE 11

Focus on
n bou

boundary ry tr trea eatm tment str trategie ies:

Traditional

l Ch Check ck Poin

int str

trategy [1] 1]

Ra

Random Bo Boundary ry Co Condition (RBC) [2] 2]

Hy

Hybrid Bo Boundary ry Co Condition (HB HBC) [3] 3]

Du

Duri ring forw

rward pr

propagati tion, a a slice of

f the

the pr press essure field upper r bo boarder r is saved, for

r ea

each tim time step ep

On

n bac backward pr propagation, the the bor border slices ar are e us used for

r

sou

urce wave rec

econstru ruction

Tes

est spe specif ificatio ion:

Pl

Pluto 2D 2D mo model (6,9 6,960 x 1,20 1,201)

Number of
f Sho

Shots: 1

Tim

Time St Steps: : 12,8 12,860

6960 indexes

2D D Plu luto Velo elocity Mod

del

el Bou Boundary ry Con Condit ition Mem

em. Req

equirements

St Strategy Required Mem emory (GB) (GB) Imag Image Qu Quality Checkpoint 311.4 High RBC 0.25 Low HBC HBC 1.04 1.04 Hig High

Reducing Memory Requirements

SLIDE 12

Fix

Fixed-poin int representation

Fix

ixed-poin int t op

perations gen

enerall lly req equire les less clo clock cy cycle les

Word le

length fix fixed in in 24 24 bits its

Mem

emory effi ficie iency is is in incr creased

HW/SW Valid

alidation

A fix

fixed-poin int reference e soft

ftware mod
del

el was devel eloped and its its ou

utp

tputs wer ere ver erif ified ed

1 23 bits 24-bit Fixed-point Numeric Representation*

Bit 0 – Sign bit
Bits 1-23 – Fraction part

*No Integer part, all values between –1 and 1

Reducing Memory Requirements

SLIDE 13

Com
mple

lete sol solutio ion is s a a hw hw/sw sw co co-design:

RTM CP

CPU-based hos

st

t applic lication

RTM FPGA-based acce

ccele leration kern rnel

Th

The e Hos

st

t app appli licatio ion is s resp esponsib ible for:

r:
Con

Config iguring kern ernel l parameters

Processin

ing in input t and ou

utp

tput t data

Dis

Distrib ibuting shots

ts among mult

ltiple FPGA

Stackin

ing ou

utp

tput im images

Each ker

ernel l pe perf rform rms an an full full im image migr igratio ion

Hardware-based Acceleration

SLIDE 14

Co-design Architecture

SLIDE 15

Space Par

aralle llelis ism:

All

l pr pres essure fi field lds of

f the

the sam same tim time step can be be upd updated si simult ltaneously ly

Mul

ultip iple le Proc

cessing Ele

lements upd update up up to

21

21 pr pres essure poi points ts pe per r iteratio ion

Tim

ime Par aralle llelis ism:

Con
nsecutiv

ive tim time steps can be be com

mputed in

in pip pipelin ine

A tot
tal

l of

f 24 cascading Pipeli

lined Stag aged Mod

dule

les (P (PSM) str tream tim time it iteratio ions

Space Parallelism Time Parallelism

RTMCore's Architecture

SLIDE 16

Th

The desi sign model l is is base ased on

n research

pres esented in in [4] [4]

Proposed Ker ernel l Architecture

RTMCore's Architecture

SLIDE 17

Evaluati

tion of

f the

the FPG FPGA per erformance ag again inst t tr traditi tional l ac accele lerati tion alt alternativ ives, su such ch as as GP GPU an and Mult ltit ithreadin ing

Two asp

aspects ts wer ere consid idered

Mig

igratio ion Tim ime: ho how fas ast is s a a seis seismic sho shot migrated?

En

Energy efficiency: whic hich acce accelerator de deli livers s mor

re

per performance, , while ile req equiring les ess ene energy? Mig igratio ion Tim ime

Tmi

mig = Tcpu + Twrit write + Tread + Tkernel

Co Consumed En Energy

T = = Tmig (Hou

ur)

N = = Number of

f Power Samples

es P(I) (I) = In Instantaneous Power (W (W)

Performance Evaluation

SLIDE 18

RTM

TM imple plementatio ions for

r perf

performance com

mparis

ison: A. A. Seri erial l CPU: : use used as as tar arget reference for

r

spe speed up up ana analysis is B. B. Mul ultit ithread CPU: : 40 CPUs s com

mputin

ing pr pres essure fi field lds in n par paralle lel for

r ea

each tim time step (sp (space pa paralle leli lism) C. C. GPU CUDA: NVi Vidia's Tit Titan X (1 (11 TF TFLOPs) exp xplo lorin ing mas assiv ive spa space par parall llelis ism D. D. FP FPGA: RTM TM ker ernel explo lorin ing bo both spa space and and tim time e par paralle leli lism

Multithread CP CPU NVid idia ia's 's Tit itan X In Intel' l's Arr rria ia 10 10 De

Dev. Kit

Kit

Performance Evaluation

SLIDE 19

Example: 1 Min. Migration Samples

Power Measuring Methodology
A po

power meter de devic ice was as pla placed be betw tween po power sup supply an and hos host

Bot
th hos

host an and de devic ice po power wer ere meas easured du durin ing RTM executio ions

Power meter de

devic ice was as con

nfig

igured to to coll

llect sam

samples at 10H 10Hz

On

Only GPU GPU an and FPGA FPGA po power wer ere meas easured Energy Measuring Setup

Performance Evaluation

SLIDE 20

In

Input Par arameters

Plu

Pluto 2D 2D (6,9 (6,960 x x 1,201 1,201)

12,8

12,860 Tim ime ste teps

Sho

Shot Pos

sit

itio ion: 3,48 3,480 x x 0

Number of
f Sho

Shots: 1

Ov

Overall Wor

rkload: 1.4

1.4 GB GB

Efficiency measured in

in Sp Speedup/Wh Wh

Performance Results

6,960 indexes Imp Implementatio ion Ru Runtime (s) (s) Spe Speed up up En Energy (W (Wh) Efficiency Serial CPU 21,8 21,873.8 .85 1

Multithread

2,429.5 9

GPU Titan X

182. 182.7 124 36 3.44 FPG FPGA Arr rria ia 10 10 194 194 112 112 20 20 5.60 5.60

Performance Results

SLIDE 21

Scala

labil ilit ity of

f th

the e solu

lution lies

lies in in th the e paralle lelizati tion of

f shots
ts
Multiple FPG

FPGA boa boards in one

ne or
r mor
re com
mpute no

nodes

Hig

Higher scala labili lity can be e ach chie ieved ed by exp xploring tem emporal parallel elis ism

Incr

Increasing the nu number of

f Pip

Pipelin ine St Stage Modules

Mor
re iterations could

ld be be com

mputed in pa

parall llel

Exp

xploration of

f fix

fixed ed-point computati tion

Poss
ssib

ibility to to explo lore suc such meth thod in n 3D 3D ste tencil il ope

perators

22/24

Concluding Remarks

SLIDE 22

Spee

eedups of

f 112x

x can be e ach chie ieved, when en com

mpared to
a Seq

equen enti tial l CP CPU im implementation

GPU

GPU is s on

nly

ly 9% 9% fas aster

Con

Consid ideration: FPGA FPGA ach achieved suc such a a per performance wit ith 8 8 tim imes s lower fr frequency

Alth

lthough th the e des esign ign present lo lower spee eed up com

mpared to
GPU, ou
ur FPGA acce

ccelerator ach chie ieved ed better energy effic ficiency

The po

power con

nsumption whe

hen com

mpared to

to a a GPU GPU ha has s bee been red educed up up to to 55% 55% wit ith an an effic iciency 60% 60% gr greater

Concluding Remarks

SLIDE 23

Acknowledgments

22/24

SLIDE 24

[1 [1] ] Symes, Will illiam W. . "R "Reverse tim time mig igration with ith op

ptimal

l ch checkpoin inting." Ge Geophysic ics 72 72.5 .5 (20 (2007): SM SM213-SM221. [2] [2] Clap lapp, Robert G.

G. "R

"Reverse tim time mig igration with ith ran andom bou

undaries." Seg

Seg tec echnic ical l pro program expa panded ab abstra racts 20 2009

09. So

Socie iety of

f Exploration Ge

Geophysicists, 20 2009

09. 28

2809 09- 28 2813 13. [3] [3] Liu Liu, Hon

ngwei,

i, et t al.

al. "W

"Wavefie ield ld reconstruction methods s for

r reverse tim

time mig igration." Jou Journal l of

f Ge

Geophysic ics an and d Eng ngin ineeri ring 10 10.1 .1 (20 (2012): 01 0150 5004. [4] [4] Sa Sano, Kentaro, Yoshia iaki i Hatsuda, an and Sa Satoru Yam

amamoto. "M

"Multi-FPGA acc accelerator for sc scala lable stencil l computation with ith con

nstant memory ban

andwid idth." IE IEEE Tran ransactions

n
n Par

arall llel l an and d Di Distrib ibuted Systems 25 25.3 .3 (20 (2013): 69 695-705. 705. References

SLIDE 25