[PPT] - PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON PowerPoint Presentation

SLIDE 1

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS

Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli

Department of Electrical and Computer Engineering Department of Bioengineering*

SLIDE 2

SIMULATION OF PHOTON TRANSPORT INSIDE HUMAN BRAIN

Photon migration in 3D

turbid media

Prediction of experimental
utcomes
Simulation is a time-

consuming task

2 GTC April 4-7, 2016 | Silicon Valley

SLIDE 3

GTC April 4-7, 2016 | Silicon Valley 3

MCX.SPACE

SLIDE 4

MCX AROUND THE WORLD

Over 30,000 unique visits made from

148 countries

Accumulative download is over 12,000

worldwide

Over 900 registered users, from more

than 350 institutions/companies around the world

GTC April 4-7, 2016 | Silicon Valley 4

SLIDE 5

MCX STATISTICS

GTC April 4-7, 2016 | Silicon Valley 5

SLIDE 6

OUTLINE

Portable Performance Monte Carlo

Extreme (MCX)

MCX in CUDA Persistent Threads in CUDA (MCX) Portable Performance MCX Other enhacements Results

MCX on multiple GPUs

Performance Model Partitioning Schemes Performance Results

6 GTC April 4-7, 2016 | Silicon Valley

SLIDE 7

PORTABLE PERFORMANCE MCX

Photons initialization

3D voxelated media

7 GTC April 4-7, 2016 | Silicon Valley

SLIDE 8

MONTE CARLO EXTREME (MCX)

Estimates the 3D light (fluence) distribution by

simulating a large number of independent photons

Most accurate algorithm for a wide ranges of

ptical properties, including low-scattering/

voids, high absorption and short source- detector separation

Computationally intensive, so a great target for

GPU acceleration

Widely adopted for bio-optical imaging

applications:

Optical brain functional imaging Fluorescence imaging of small animals for drug development Gold stand for validating new optical imaging instrumentation

designs and algorithms

8 GTC April 4-7, 2016 | Silicon Valley

SLIDE 9

MCX APPLICATIONS

Imaging of bone marrow in the tibia Imaging of a complex mouse model using Monte Carlo simulations Simulation of photons inside human brain

9 GTC April 4-7, 2016 | Silicon Valley

SLIDE 10

MCX IN CUDA [1]

Thread i Thread i+1 …

Launch a new photon Compute a new scattering length Propagate photon until cross voxel boundary Compute attenuation based on absorption Accumulate photon energy loss to the volume

End of scattering path? Total photon # reached?

Terminate thread

Exceeding time gate?

Compute a new scattering direction vector Global Memory

y y n n y n

Seed GPU RNG with CPU RNG

(optional) Repetition complete?

Retrieve solution Normalize & save solution

CPU GPU Start End of simulation Loop of repetitions

[1] Q. Fang and D. A. Boas. "Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units." Optics express 17.22 (2009): 20178-20190. 10 GTC April 4-7, 2016 | Silicon Valley

SLIDE 11

PERSISTENT THREADS (PT) IN MCX

PT kernels alter the notion of a virtual thread lifetime,

treating those threads as physical hardware threads

PT kernels provide a view that threads are active for

the entire duration of the kernel

We schedule only as many threads as the GPU SMs can

concurrently run

The threads remain active until end of kernel execution Worker thread

Thread initializes and enter thread loop Thread loops continuously Thread exits loop, clean up, and shut down

11 GTC April 4-7, 2016 | Silicon Valley

SLIDE 12

PORTABLE PERFORMANCE MCX

autoBlock = MaxThreadsPerMP / MaxBlocksPerMP autoThread = autoBlock * MaxBlocksPerMP* MP

Feature Fermi Kepler Maxwell MaxThreadBlocks/ MP 8 16 32 Maxthreads/MP 1536 2048 2058 MP 16 14 22 CUDA cores/MP 32 192 128

12 GTC April 4-7, 2016 | Silicon Valley

SLIDE 13

OTHER ENHANCEMENTS

Autopilot improvement Developed customized operation such

as:

mcx_nextafter

Reduced the use of SharedMemory

Enables more threads to be launch

Avoided branch divergence by using

indexes

13 GTC April 4-7, 2016 | Silicon Valley

SLIDE 14

IMPROVEMENT PER ENHANCEMENT

14 GTC April 4-7, 2016 | Silicon Valley

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% GK110 980Ti

Autopilot Reducing Shared Memory Increasing Local Memory/ Hide Latency Avoid branch divergence/ Customized function

Overall Performance

2.4x 1.4x

SLIDE 15

PERFORMANCE MCX - RESULTS

Arch GPU Photons/ms (Baseline) Photons/ms Speedup

Fermi GTX 590 2044.99 2901.92 1.4x Kepler GT 730 529.89 1263.74 2.4x Kepler GK110 2383.22 5238.34 2.2x Maxwell 980Ti 12268.98 19157.09 1.4x

5000 10000 15000 20000 25000 GTX 590 GT 730 GK110 980Ti Photons GPUs

Performance (photons/ms)

15 GTC April 4-7, 2016 | Silicon Valley

Baseline: MCX version Sep 12, 2015

SLIDE 16

MCX AS A BENCHMARK

GTC April 4-7, 2016 | Silicon Valley 16

MCX_core.ptx MCX_core.sass

0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 1800.00

Baseline After Improvement After Improvement with Hack

0.14 0.14 0.14 1799.76 1550.63 1368.96

Size (KB)

CUDA 7.5 - Maxwell Compute 5.2 (980Ti) +10x

10x

Performance is changing dramatically

Same input
Same code of sequence

SLIDE 17

MCX ON MULTIPLE GPUS

SLIDE 18

MOTIVATION

Monte Carlo eXtreme (MCX) simulation in OpenCL

Distribute workloads among different devices

NVIDIA GPUs / AMD GPUs / CPUs

thread thread thread GPU1

MCXCL

GPU2 GPU3

Platform Partitioning Scheme

18 GTC April 4-7, 2016 | Silicon Valley

SLIDE 19

METHODOLOGY

Predict the kernel execution time

Evaluate the kernel runtime Develop the performance model

Partitioning Schemes Core-based

The number of parallel compute units

Throughput

Application throughput (photons/ms)

Iterative

Throughput- based iterative partitioning

Fminimax

Nonlinear linear programming solution for minimax problem

19 GTC April 4-7, 2016 | Silicon Valley

SLIDE 20

PERFORMANCE MODEL

Measure the kernel execution time on various devices Simulate 1M to 25M photon migrations

20 GTC April 4-7, 2016 | Silicon Valley

SLIDE 21

PERFORMANCE MODEL

Given n devices: D1, D2, … Dn Given linear performance for each device Given the performance for 1M and 2M for each device We can obtain the linear equation for each device as

follows:

y1 = a1x1 +c1

Device 1 :

. . . . Device 2 :

2 2 2 2

c x a y + =

Device n :

n n n n

c x a y + =

21 GTC April 4-7, 2016 | Silicon Valley

SLIDE 22

PARTITIONING SCHEME ELABORATION

Core-based Initialization Iteratively evaluate throughput-based partitioning Stop when achieving the max throughput

Iterative Approximation

ComputeUnitsi ComputeUnitsi

∑

Throughputi Throughputi

∑

22 GTC April 4-7, 2016 | Silicon Valley

SLIDE 23

PERFORMANCE RESULTS

5000 10000 15000 20000 25000 30000 10M 100M 10M 100M GTX 980 Ti + GTX 590 + GT 730 K40c + K20c

Core-based Throughput Iterative Fminimax

Max throughput 30323 photons/ms Max throughput 9688 photons/ms

10M 100M

Core-based 35.01% 41.65% Throughput 59.31% 93.42% Iterative 68.85% 93.77% Fminimax 68.85% 93.77%

10M 100M

Core-based 85.31% 97.56% Throughput 80.39% 87.89% Iterative 80.39% 87.89% Fminimax 80.39% 87.89%

Throughput Utilization Throughput Utilization

23 GTC April 4-7, 2016 | Silicon Valley

SLIDE 24

PERFORMANCE RESULTS

500 1000 1500 2000 2500 3000 3500 4000 4500 10M 100M 10M 100M AMD 7970M + Intel i7-3740QM AMD 7970 + Fiji + Intel i7-4770

Core-based Throughput Iterative Fminimax

Max throughput 4529 photons/ms Max throughput 19176 photons/ms

10M 100M

Core-based 19.32% 18.69% Throughput 18.81% 27.14% Iterative 18.78% 27.91% Fminimax 18.78% 27.91%

10M 100M

Core-based 15.10% 19.06% Throughput 16.38% 21.10% Iterative 16.38% 21.10% Fminimax 16.38% 21.10%

Throughput Utilization Throughput Utilization

24 GTC April 4-7, 2016 | Silicon Valley

SLIDE 25

SUMMARY

We have improved the performance of MCX

across a range of NVIDIA GPU architectures

We have showed how to exploit Persistent Thread

kernel to automatically tune MCX kernel

We developed an iterative scheme to search the

best partition to run MCX on multiple accelerators

We obtained an 24% and 44% throughput

utilization improvement (Iterative vs Core-based) for 10M and 100M photon simulations, respectively

25 GTC April 4-7, 2016 | Silicon Valley

SLIDE 26

FUTURE WORK

Instrumentation of MCX

Leverage SASSI to instrument MCX and

better characterize the behavior of a kernel to guide auto-tuning

MCX on Multiple GPUs

Evaluate our partitioning optimization for

multiple devices

26 GTC April 4-7, 2016 | Silicon Valley

SLIDE 27

MCX CHALLENGE

Interested in improving performance of

MCX over 40% compared to current version?

Monetary reward will be announced soon.

Stay tuned to mcx.space

27 GTC April 4-7, 2016 | Silicon Valley

SLIDE 28

ACKNOWLEDGEMENT

This project is funded by the NIH/NIGMS

under the grant R01-GM114365

We would like to acknowledge NVIDIA

for their support for this work through the NVIDIA Research Center program

28 GTC April 4-7, 2016 | Silicon Valley

SLIDE 29

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS

Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli

SIMULATION OF PHOTON TRANSPORT INSIDE HUMAN BRAIN

MCX.SPACE

MCX AROUND THE WORLD

 Over 30,000 unique visits made from

148 countries

 Accumulative download is over 12,000

worldwide

 Over 900 registered users, from more

than 350 institutions/companies around the world

MCX STATISTICS

OUTLINE

 Portable Performance Monte Carlo

Extreme (MCX)

 MCX in CUDA  Persistent Threads in CUDA (MCX)  Portable Performance MCX  Other enhacements  Results

 MCX on multiple GPUs

 Performance Model  Partitioning Schemes  Performance Results

PORTABLE PERFORMANCE MCX

MONTE CARLO EXTREME (MCX)

 Estimates the 3D light (fluence) distribution by

simulating a large number of independent photons

 Most accurate algorithm for a wide ranges of

voids, high absorption and short source- detector separation

 Computationally intensive, so a great target for

GPU acceleration

 Widely adopted for bio-optical imaging

applications:

MCX APPLICATIONS

MCX IN CUDA [1]

PERSISTENT THREADS (PT) IN MCX

treating those threads as physical hardware threads

the entire duration of the kernel

concurrently run

PORTABLE PERFORMANCE MCX

autoBlock = MaxThreadsPerMP / MaxBlocksPerMP autoThread = autoBlock * MaxBlocksPerMP* MP

OTHER ENHANCEMENTS

 Autopilot improvement  Developed customized operation such

as:

 mcx_nextafter

 Reduced the use of SharedMemory

 Enables more threads to be launch

 Avoided branch divergence by using

indexes

IMPROVEMENT PER ENHANCEMENT

2.4x 1.4x

PERFORMANCE MCX - RESULTS

 Baseline: MCX version Sep 12, 2015

MCX AS A BENCHMARK

MCX ON MULTIPLE GPUS

MOTIVATION

 Monte Carlo eXtreme (MCX) simulation in OpenCL

 Distribute workloads among different devices

 NVIDIA GPUs / AMD GPUs / CPUs

METHODOLOGY

 Predict the kernel execution time

 Evaluate the kernel runtime  Develop the performance model

 Partitioning Schemes Core-based

Throughput

Iterative

Fminimax

PERFORMANCE MODEL

PERFORMANCE MODEL

follows:

y1 = a1x1 +c1

Device 1 :

2 2 2 2

c x a y + =

n n n n

c x a y + =

PARTITIONING SCHEME ELABORATION

∑

Throughputi Throughputi

∑

PERFORMANCE RESULTS

Core-based Throughput Iterative Fminimax

PERFORMANCE RESULTS

Core-based Throughput Iterative Fminimax

SUMMARY

 We have improved the performance of MCX

Over 30,000 unique visits made from

Accumulative download is over 12,000

Over 900 registered users, from more

Portable Performance Monte Carlo

MCX in CUDA Persistent Threads in CUDA (MCX) Portable Performance MCX Other enhacements Results

MCX on multiple GPUs

Performance Model Partitioning Schemes Performance Results

Estimates the 3D light (fluence) distribution by

Most accurate algorithm for a wide ranges of

Computationally intensive, so a great target for

Widely adopted for bio-optical imaging

Autopilot improvement Developed customized operation such

mcx_nextafter

Reduced the use of SharedMemory

Enables more threads to be launch

Avoided branch divergence by using

Baseline: MCX version Sep 12, 2015

Monte Carlo eXtreme (MCX) simulation in OpenCL

Distribute workloads among different devices

NVIDIA GPUs / AMD GPUs / CPUs

Predict the kernel execution time

Evaluate the kernel runtime Develop the performance model

Partitioning Schemes Core-based

We have improved the performance of MCX

We have showed how to exploit Persistent Thread

We developed an iterative scheme to search the

We obtained an 24% and 44% throughput

Instrumentation of MCX

Leverage SASSI to instrument MCX and

MCX on Multiple GPUs

Evaluate our partitioning optimization for

Interested in improving performance of

Monetary reward will be announced soon.

This project is funded by the NIH/NIGMS

We would like to acknowledge NVIDIA