SLIDE 1
an and Education for In Industry The SENAI CIMATEC Campus - - PowerPoint PPT Presentation
an and Education for In Industry The SENAI CIMATEC Campus - - PowerPoint PPT Presentation
UNIT Technology, In Innovation an and Education for In Industry The SENAI CIMATEC Campus Highlights 4 buildings More than 35,000 m Over US$200 million of investment 42 competence areas More than 800 employee SENAI CIMATEC
SLIDE 2
SLIDE 3
SENAI CIMATEC Supercomputing Center
SLIDE 4
Supercomputing Center Timeline
2021
- 1. Yemoja -
Oil & Gás
- 3. HPC
FINEP
- 2. Fiocruz
OMOLU
2012 2018 2016 2015
- 0. Cloud
(Datacenter)
2019 Services Models Projects Researches Masters Specialists PhDs 2020
............................................
50 TFlops 405 TeraFlops 180 TFlops 800 TeraFlops
Areas of Actuation
- 6. HPC
Industrial CIMATEC
Innovation
- 7. Quantum
- 4. HPC Oil &
Gas (CPU)
- 5. HPC Oil &
Gas (GPU)
1.8 PetaFlops
SLIDE 5
+ARM
Intel Xeon
- 59 Nós Xeon 6148
- Total: 127 TF
Xeon Phi
- 4 Nós
- 8 TFlops
NVidia GPU
- 2 Nós x 2 P100 Nvlink
- Total: 13 TF
FPGA
- 2 x Arria10
- Total: 2 TFlops
+GPU
CS2I SINAPAD
HPC Ògún: Heterogeneous Computing
SLIDE 6
Summary
RTM Brief Review Main Computational Challenges Reducing Memory Requirements Hardware-based Acceleration RTMCore's Architecture Performance Tests Conclusions
Performance and Energy Efficiency Analysis of a Reverse Time Migration Design on FPGA
João Carlos Bittencourt, Joaquim Oliveira, Anderson Nascimento, Rodrigo Tutu, Lauê Jesus, Georgina Rojas, Deusdete Matos, Leonardo Fialho, André Lima, Erick Nascimento, João Marcelo Souza, Adhvan Furtado, and Wagner Oliveira
SLIDE 7
- RTM
TM is is a a Se Seism ismic Mig igration tech echnique for
- r accu
accurate im imag aging of
- f subsurfaces
with ith gr great str tructural l an and velo elocity com
- mplexit
ities
- Lar
Largel ely use sed in in Se Seis ismic ic Im Imagin ing Flo low for
- r refin
inin ing boun
- undarie
ies in in velo elocit ity mod
- del
l buildin ilding proces esse ses s (F (FWI, PSO SO, , Tom
- mography, etc.
c.)
Seismogram Data Impulse Source Enhanced Subsurface Image RTM Input Velocity Model
Overview on Reverse Time Migration (RTM)
SLIDE 8
- Project's specific:
- 2D RTM
- Point Source and Receiver
- Second-order acoustic wave
- Finite-difference based
solution
- P-waves only
Wave Propagation Geometric Layout Imaging Condition
Overview on Reverse Time Migration (RTM)
SLIDE 9
- RTM requir
ires a a mas assive computation power, memory ry an and storage to
- mig
igrate even sm small ll fie field lds
- Fin
Finit ite-difference (St (Stencil il) op
- perators
require se several l memory ac accesses
- Mig
igration tim time an and ass associated energy costs may be prohibit itive on
- n production
sc scale le Main Computational Challenges
SLIDE 10
- Optimization Go
Goals ls:
- Reducing mem
emory ry req equirements
- Reducing migration tim
ime an and ene energy consumption
- De
Design Str Strategy:
- Cho
Choosin ing mem emory ry efficient al algorithms
- Op
Optimizin ing mem emory ry ac access
- Efficient des
design of
- f he
heterogeneous com
- mputing ac
accelerators on
- n FPGA
FPGA an and GPU GPU
Main Computational Challenges
SLIDE 11
- Focus on
- n bou
boundary ry tr trea eatm tment str trategie ies:
- Traditional
l Ch Check ck Poin
- int str
trategy [1] 1]
- Ra
Random Bo Boundary ry Co Condition (RBC) [2] 2]
- Hy
Hybrid Bo Boundary ry Co Condition (HB HBC) [3] 3]
- Du
Duri ring forw
- rward pr
propagati tion, a a slice of
- f the
the pr press essure field upper r bo boarder r is saved, for
- r ea
each tim time step ep
- On
n bac backward pr propagation, the the bor border slices ar are e us used for
- r
sou
- urce wave rec
econstru ruction
- Tes
est spe specif ificatio ion:
- Pl
Pluto 2D 2D mo model (6,9 6,960 x 1,20 1,201)
- Number of
- f Sho
Shots: 1
- Tim
Time St Steps: : 12,8 12,860
6960 indexes
2D D Plu luto Velo elocity Mod
- del
el Bou Boundary ry Con Condit ition Mem
- em. Req
equirements
St Strategy Required Mem emory (GB) (GB) Imag Image Qu Quality Checkpoint 311.4 High RBC 0.25 Low HBC HBC 1.04 1.04 Hig High
Reducing Memory Requirements
SLIDE 12
- Fix
Fixed-poin int representation
- Fix
ixed-poin int t op
- perations gen
enerall lly req equire les less clo clock cy cycle les
- Word le
length fix fixed in in 24 24 bits its
- Mem
emory effi ficie iency is is in incr creased
- HW/SW Valid
alidation
- A fix
fixed-poin int reference e soft
- ftware mod
- del
el was devel eloped and its its ou
- utp
tputs wer ere ver erif ified ed
1 23 bits 24-bit Fixed-point Numeric Representation*
- Bit 0 – Sign bit
- Bits 1-23 – Fraction part
*No Integer part, all values between –1 and 1
Reducing Memory Requirements
SLIDE 13
- Com
- mple
lete sol solutio ion is s a a hw hw/sw sw co co-design:
- RTM CP
CPU-based hos
- st
t applic lication
- RTM FPGA-based acce
ccele leration kern rnel
- Th
The e Hos
- st
t app appli licatio ion is s resp esponsib ible for:
- r:
- Con
Config iguring kern ernel l parameters
- Processin
ing in input t and ou
- utp
tput t data
- Dis
Distrib ibuting shots
- ts among mult
ltiple FPGA
- Stackin
ing ou
- utp
tput im images
- Each ker
ernel l pe perf rform rms an an full full im image migr igratio ion
Hardware-based Acceleration
SLIDE 14
Co-design Architecture
SLIDE 15
- Space Par
aralle llelis ism:
- All
l pr pres essure fi field lds of
- f the
the sam same tim time step can be be upd updated si simult ltaneously ly
- Mul
ultip iple le Proc
- cessing Ele
lements upd update up up to
- 21
21 pr pres essure poi points ts pe per r iteratio ion
- Tim
ime Par aralle llelis ism:
- Con
- nsecutiv
ive tim time steps can be be com
- mputed in
in pip pipelin ine
- A tot
- tal
l of
- f 24 cascading Pipeli
lined Stag aged Mod
- dule
les (P (PSM) str tream tim time it iteratio ions
Space Parallelism Time Parallelism
RTMCore's Architecture
SLIDE 16
- Th
The desi sign model l is is base ased on
- n research
pres esented in in [4] [4]
Proposed Ker ernel l Architecture
RTMCore's Architecture
SLIDE 17
- Evaluati
tion of
- f the
the FPG FPGA per erformance ag again inst t tr traditi tional l ac accele lerati tion alt alternativ ives, su such ch as as GP GPU an and Mult ltit ithreadin ing
- Two asp
aspects ts wer ere consid idered
- Mig
igratio ion Tim ime: ho how fas ast is s a a seis seismic sho shot migrated?
- En
Energy efficiency: whic hich acce accelerator de deli livers s mor
- re
per performance, , while ile req equiring les ess ene energy? Mig igratio ion Tim ime
Tmi
mig = Tcpu + Twrit write + Tread + Tkernel
Co Consumed En Energy
T = = Tmig (Hou
- ur)
N = = Number of
- f Power Samples
es P(I) (I) = In Instantaneous Power (W (W)
Performance Evaluation
SLIDE 18
- RTM
TM imple plementatio ions for
- r perf
performance com
- mparis
ison: A. A. Seri erial l CPU: : use used as as tar arget reference for
- r
spe speed up up ana analysis is B. B. Mul ultit ithread CPU: : 40 CPUs s com
- mputin
ing pr pres essure fi field lds in n par paralle lel for
- r ea
each tim time step (sp (space pa paralle leli lism) C. C. GPU CUDA: NVi Vidia's Tit Titan X (1 (11 TF TFLOPs) exp xplo lorin ing mas assiv ive spa space par parall llelis ism D. D. FP FPGA: RTM TM ker ernel explo lorin ing bo both spa space and and tim time e par paralle leli lism
Multithread CP CPU NVid idia ia's 's Tit itan X In Intel' l's Arr rria ia 10 10 De
- Dev. Kit
Kit
Performance Evaluation
SLIDE 19
Example: 1 Min. Migration Samples
- Power Measuring Methodology
- A po
power meter de devic ice was as pla placed be betw tween po power sup supply an and hos host
- Bot
- th hos
host an and de devic ice po power wer ere meas easured du durin ing RTM executio ions
- Power meter de
devic ice was as con
- nfig
igured to to coll
- llect sam
samples at 10H 10Hz
- On
Only GPU GPU an and FPGA FPGA po power wer ere meas easured Energy Measuring Setup
Performance Evaluation
SLIDE 20
- In
Input Par arameters
- Plu
Pluto 2D 2D (6,9 (6,960 x x 1,201 1,201)
- 12,8
12,860 Tim ime ste teps
- Sho
Shot Pos
- sit
itio ion: 3,48 3,480 x x 0
- Number of
- f Sho
Shots: 1
- Ov
Overall Wor
- rkload: 1.4
1.4 GB GB
- Efficiency measured in
in Sp Speedup/Wh Wh
Performance Results
6,960 indexes Imp Implementatio ion Ru Runtime (s) (s) Spe Speed up up En Energy (W (Wh) Efficiency Serial CPU 21,8 21,873.8 .85 1
- Multithread
2,429.5 9
- GPU Titan X
182. 182.7 124 36 3.44 FPG FPGA Arr rria ia 10 10 194 194 112 112 20 20 5.60 5.60
Performance Results
SLIDE 21
- Scala
labil ilit ity of
- f th
the e solu
- lution lies
lies in in th the e paralle lelizati tion of
- f shots
- ts
- Multiple FPG
FPGA boa boards in one
- ne or
- r mor
- re com
- mpute no
nodes
- Hig
Higher scala labili lity can be e ach chie ieved ed by exp xploring tem emporal parallel elis ism
- Incr
Increasing the nu number of
- f Pip
Pipelin ine St Stage Modules
- Mor
- re iterations could
ld be be com
- mputed in pa
parall llel
- Exp
xploration of
- f fix
fixed ed-point computati tion
- Poss
- ssib
ibility to to explo lore suc such meth thod in n 3D 3D ste tencil il ope
- perators
22/24
Concluding Remarks
SLIDE 22
- Spee
eedups of
- f 112x
x can be e ach chie ieved, when en com
- mpared to
- a Seq
equen enti tial l CP CPU im implementation
- GPU
GPU is s on
- nly
ly 9% 9% fas aster
- Con
Consid ideration: FPGA FPGA ach achieved suc such a a per performance wit ith 8 8 tim imes s lower fr frequency
- Alth
lthough th the e des esign ign present lo lower spee eed up com
- mpared to
- GPU, ou
- ur FPGA acce
ccelerator ach chie ieved ed better energy effic ficiency
- The po
power con
- nsumption whe
hen com
- mpared to
to a a GPU GPU ha has s bee been red educed up up to to 55% 55% wit ith an an effic iciency 60% 60% gr greater
Concluding Remarks
SLIDE 23
Acknowledgments
22/24
SLIDE 24
[1 [1] ] Symes, Will illiam W. . "R "Reverse tim time mig igration with ith op
- ptimal
l ch checkpoin inting." Ge Geophysic ics 72 72.5 .5 (20 (2007): SM SM213-SM221. [2] [2] Clap lapp, Robert G.
- G. "R
"Reverse tim time mig igration with ith ran andom bou
- undaries." Seg
Seg tec echnic ical l pro program expa panded ab abstra racts 20 2009
- 09. So
Socie iety of
- f Exploration Ge
Geophysicists, 20 2009
- 09. 28
2809 09- 28 2813 13. [3] [3] Liu Liu, Hon
- ngwei,
i, et t al.
- al. "W
"Wavefie ield ld reconstruction methods s for
- r reverse tim
time mig igration." Jou Journal l of
- f Ge
Geophysic ics an and d Eng ngin ineeri ring 10 10.1 .1 (20 (2012): 01 0150 5004. [4] [4] Sa Sano, Kentaro, Yoshia iaki i Hatsuda, an and Sa Satoru Yam
- amamoto. "M
"Multi-FPGA acc accelerator for sc scala lable stencil l computation with ith con
- nstant memory ban
andwid idth." IE IEEE Tran ransactions
- n
- n Par
arall llel l an and d Di Distrib ibuted Systems 25 25.3 .3 (20 (2013): 69 695-705. 705. References
SLIDE 25