Breaking Through the Barriers to GPU Accelerated Monte Carlo - - PowerPoint PPT Presentation

breaking through the barriers to gpu accelerated monte
SMART_READER_LITE
LIVE PREVIEW

Breaking Through the Barriers to GPU Accelerated Monte Carlo - - PowerPoint PPT Presentation

Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport GTC 2018 Jeremy Sweezy Scientist Monte Carlo Methods, Codes and Applications Group 3/28/2018 Operated by Los Alamos National Security, LLC for the U.S.


slide-1
SLIDE 1 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

Jeremy Sweezy

Scientist Monte Carlo Methods, Codes and Applications Group

3/28/2018

LA-UR-18-XXXX

Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport

GTC 2018

slide-2
SLIDE 2

What is Monte Carlo Particle Transport?

3/23/18 | 2 Los Alamos National Laboratory

– Follows the path of individual particles through a system

– Uses pseudo-random numbers to sample processes – Randomly sample physical and non-physical processes – Attributed to Stanislaw Ulam and

Enrico Fermi – Named because Ulam had an uncle who who would borrow money from relatives because he “just had to go to Monte Carlo”

FERMIAC

slide-3
SLIDE 3

Porting to Specialized Hardware is Prohibitively Expensive

3/23/18 | 3 Los Alamos National Laboratory

–The world’s production Monte Carlo codes have decades of development –LANL’s MCNP code has been in development since 1977 –Equally extensive amount of V&V effort –Codes have to run on desktop machines and super-computers –DOE HPC platforms have been in a state of flux for the last 10-years

  • Cell Broadband Engine
  • Intel Xeon Phi (MIC)
  • GPUs
  • ARM???

Barrier #1: Limited Resources (Money, People, Time)

slide-4
SLIDE 4

Monte Carlo Random Walk on GPU Hardware has reached a Performance Wall

3/25/18 | 4 Los Alamos National Laboratory

  • A least 6 different research groups have ported the Monte Carlo random walk to GPU

hardware for neutron transport

  • All report results against different numbers of CPUs
  • All get the same results!
  • Almost all are extremely simplified
  • Production codes will likely have

worse performance.

  • What are the limitations?

– Conditional branching – Random data access – No small computational intensive kernel to accelerate Barrier #2: Performance of random walk on GPUs 4.5x 3.0x

slide-5
SLIDE 5

How do You Define Performance?

3/23/18 | 5 Los Alamos National Laboratory

  • A computer scientist might measure performance as an increase in

speed.

𝑸 = 𝑼𝑫𝑸𝑽 𝑼𝑯𝑸𝑽

  • A Monte Carlo specialist would measure performance as an balance

between speed and statistical variance using a Figure-of-Merit

To date, almost all GPU implementations of Monte Carlo particle transport

  • f have focused on increasing speed.

𝑭𝒚𝒃𝒏𝒒𝒎𝒇: 𝑮𝑷𝑵 = 𝟏. 𝟐𝟑 7 𝟐 min 𝟏. 𝟏𝟔𝟑 7 𝟑 min = 𝟑 𝑮𝑷𝑵 = 𝝉𝑫𝑸𝑽

𝟑

𝑼𝑫𝑸𝑽 𝝉𝑯𝑸𝑽

𝟑

𝑼𝑯𝑸𝑽

slide-6
SLIDE 6

Next Event Estimator

3/23/18 | 6 Los Alamos National Laboratory

  • Next-event estimator calculates the

probability of a particle from a source or collision event reaches a point without interaction

  • Typically used for image tallies

A Cell 1 Cell 2 μ Image Plane B

𝑻 𝑺, 𝑭 = 𝒙 𝟑𝝆𝑺𝟑 × C 𝝉𝒋 𝑺, 𝑭 𝝉𝑼 𝒒𝒋 𝝂, 𝑭 → 𝑭G exp exp( − M 𝚻𝑼 𝒕, 𝑭G 𝒆𝒕

𝑺 𝟏

)

𝑶 𝒋S𝟐 Ray-cast One to two orders of magnitude faster on GPU hardware

slide-7
SLIDE 7

Traditional Track-Length Estimator

3/25/18 | 7 Los Alamos National Laboratory

  • The standard Monte Carlo fluence estimator
  • Uses the sampled distance in each cell as fluence estimator
  • Only contributes to cells through which the particle passes
  • Easy to compute
  • Nothing to accelerate on GPU

Cell 1 B Cell 2 Cell 3 Computing has changed, we need to change our algorithms too!

slide-8
SLIDE 8

Volumetric-Ray-Casting Estimator

3/25/18 | 8 Los Alamos National Laboratory

  • For use in place of the traditional track-length estimator on GPU
  • Multiple pseudo-rays are generated at each source and collision event
  • Computational intensive estimator with lower variance

Cell 1 B Cell 2 Cell 3

F 𝒋, 𝑭′ =

𝒙 𝟐UVWX U𝚻𝑼,𝒋 𝑭Y 𝒎𝒋 𝑶𝚻𝑼,𝒋(𝑭Y)

exp − ∫ 𝚻𝑼 𝒔 + 𝛁′𝒕′, 𝑭G 𝒆𝒕′

𝒔YU𝒔 𝟏 Ray-cast A neutron dance for a neutron fan. P.M. Dawn

slide-9
SLIDE 9

MonteRay - Accelerating Monte Carlo Transport with GPU Ray Tracing

3/23/18 | 9 Los Alamos National Laboratory

  • MonteRay – A library for accelerating Monte Carlo tallies with GPU
  • Random walk is maintained on CPU
  • Ray casting based tallies are calculated on the GPU

–Next-Event estimator –Volumetric-Ray-Casting estimator, a new estimator designed for GPUs –Supports neutron and photon tallies

  • Can be incorporated into new and legacy Monte Carlo codes
  • Uses continuous energy cross-section data
  • Single precision ray casting
  • Single precision attenuation cross-sections
  • Double precision tallies

Reduces cost of accelerating an existing Monte Carlo code with GPUs

slide-10
SLIDE 10

MonteRay - Testing

3/23/18 | 10 Los Alamos National Laboratory

  • Tests use:

–GeForce GTX TitanX GPU with NVIDIA Maxwell architecture –2 CPUs (Intel Haswell E5-2660 v3 at 2.60 GHz), with 10 cores each

  • MonteRay linked with LANL’s C++ Monte Carlo code MCATK
  • MCATK uses MPI parallelism building shared ray buffers using MPI-3

shared memory

  • 3-D Cartesian Structured Mesh Geometry
  • 2 tests measured performance of the Next-event estimator
  • 4 tests measured the performance of the Volumetric-ray-casting

estimator

  • Volumetric-ray-casting estimator performance on GPU compared to the

Track-length estimator performance on the CPU

  • Base performance measured as compared to 8 CPU cores
slide-11
SLIDE 11

Testing the Next-Event Estimator on GPU Hardware: Two Radiography Tests

3/23/18 | 11 Los Alamos National Laboratory

slide-12
SLIDE 12

MonteRay – Medical X-Ray Imaging Simulation

3/23/18 | 12 Los Alamos National Laboratory

  • 50-keV X-ray beam
  • 0.12mm spot size
  • Radiograph used Next-Event Estimator
  • Simulation useful for designing collimator to minimize scattered contribution
slide-13
SLIDE 13

MonteRay – Medical X-Ray Imaging Simulation

3/23/18 | 13 Los Alamos National Laboratory

  • Source and Collided

contribution calculated separately

  • Source contribution

relatively easy to calculate

  • Collided contribution

important for collimator design

  • Collided

performance 15-18x

14.5x 15.3x

slide-14
SLIDE 14

MonteRay – Industrial Radiography

3/23/18 | 14 Los Alamos National Laboratory

  • Simulated a physical test object used at Los Alamos’ Dual Axis Radiographic

Hydrodynamic Test Facility

  • Used 4-MeV mono-energetic X-ray beam
  • 100 x 100 image grid (10,000 estimators) to simulate image detector
  • Calculation of scatter component needed to design

collimators and experiment, but too computational expensive

I'm a peeping-tom techie with x-ray eyes – Patrick Lee MacDonald

slide-15
SLIDE 15

MonteRay – Industrial Radiography

3/23/18 | 15 Los Alamos National Laboratory

10 100 5 10 15 20 Relative Performance Number of CPU Cores / GPU Source Collided

Collided calculation performance 15-32x! GPU Performance vs Number of CPU Cores 28.5x 24.2x

slide-16
SLIDE 16

Volumetric-Ray-Casting Estimator on GPU Hardware vs Track-Length Estimator on CPU Hardware

3/23/18 | 16 Los Alamos National Laboratory

slide-17
SLIDE 17

Cancer Treatment Simulation

3/23/18 | 17 Los Alamos National Laboratory

  • 2-MeV Photon beam ( peak of 6MV medical accelerator photon spectrum)
  • 1-cm beam radius

Tumor 2-MeV Photon Beam What is the dose to healthy tissue? GPU Performance vs 8 CPU Cores 14x performance improvement in healthy tissue

slide-18
SLIDE 18

Cancer Treatment Simulation

3/23/18 | 18 Los Alamos National Laboratory

GPU Performance vs Number of CPU Cores in Healthy Tissue Performance is 14x vs 8 CPU cores or 10x vs 12 CPU cores 14.3x 10.2x

slide-19
SLIDE 19

Pressured Water Reactor Assembly Simulation

3/23/18 | 19 Los Alamos National Laboratory

  • 16x16 Fuel Assembly
  • Performance 7.5x in the Control Rods, 5x in the fuel, and 4.5x in the coolant

GPU Performance vs 8 CPU Cores Control Rod Fuel Pin

slide-20
SLIDE 20

Pressured Water Reactor Assembly Simulation

3/23/18 | 20 Los Alamos National Laboratory

GPU Performance vs Number of CPU Cores Compared to 8 CPU cores performance in control rod 7.2x and 6.0x in the fuel 7.2x 5.4x 6.0x 4.4x

slide-21
SLIDE 21

Criticality Accident Simulation

3/23/18 | 21 Los Alamos National Laboratory

  • Critical Uranium sphere in the corner of a concrete room
  • Concrete floor, walls, ceiling, and 4 concrete pillars

GPU Performance vs 8 CPU Cores Uranium Sphere Performance increase of 14-16x in the center of the room

slide-22
SLIDE 22

Criticality Accident Simulation – Smoother Fluence Estimate

3/23/18 | 22 Los Alamos National Laboratory

Track-Length Estimator Volumetric-Ray-Casting Estimator

slide-23
SLIDE 23

Criticality Accident Simulation

3/23/18 | 23 Los Alamos National Laboratory

GPU Performance vs Number of CPU Cores Things are going great, and they’re only getting better – Patrick Lee MacDonald 15x 10.5x

slide-24
SLIDE 24

Reflected Godiva Criticality Experiment Simulation

3/23/18 | 24 Los Alamos National Laboratory

  • U-235 sphere reflected by water
  • Performance Improvement

–2.5x in the core –1.0x in the water

GPU Performance vs 8 CPU Cores

slide-25
SLIDE 25

Reflected Godiva Criticality Experiment Simulation

3/23/18 | 25 Los Alamos National Laboratory

  • Variance of the Volumetric-Ray-Casting

estimator approaches that of the Track-Length estimator is strong scattering material.

1 1.5 2 2.5 3 3.5 4 4.5 1 4 8 12 16 20 Variance Ratio ( σTL

2 / σ2

VRC )

Number of Samples per Collision (N)

Performance is limited by the estimator variance, not the GPU speed Variance Ratio vs Num. Collisions GPU Performance vs. Num. CPU Cores 2.2x 2.2x

slide-26
SLIDE 26

Conclusions

3/23/18 | 26 Los Alamos National Laboratory

  • MonteRay provides a low cost method of providing GPU accelerated

Monte Carlo particle transport

–Can be incorporated into legacy codes at low cost. –Works with standard variance reduction methods

  • Performance improvements of MonteRay are significant:

–Up to 32 times for the Next-event estimator as compared to 8 CPU cores –Up to 14 times for the Volumetric-ray-casting estimator as compared to the Track-Length estimator on 8 CPU cores

MonteRay provides a method of breaking through the barriers of limited resources and limited performance

slide-27
SLIDE 27

Questions?

Jeremy Sweezy jsweezy@lanl.gov

3/23/18 | 27 Los Alamos National Laboratory

slide-28
SLIDE 28

Extra

3/23/18 | 28 Los Alamos National Laboratory

slide-29
SLIDE 29

Uncertainty - Pressured Water Reactor Assembly Simulation

3/23/18 | 29 Los Alamos National Laboratory

Volumetric-Ray-Casting Estimator Track-Length Estimator 600 sec., 8 CPU Cores and 1 GPU 93 cycles, 40000 Particles/Cycle 8 rays/collision 600 sec., 8 CPU Cores 124 cycles, 40000 Particles/Cycle