PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON - - PowerPoint PPT Presentation

portable performance for monte carlo simulation of photon
SMART_READER_LITE
LIVE PREVIEW

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON - - PowerPoint PPT Presentation

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli Department of Electrical and Computer Engineering Department


slide-1
SLIDE 1

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS

Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli

Department of Electrical and Computer Engineering Department of Bioengineering*

slide-2
SLIDE 2

SIMULATION OF PHOTON TRANSPORT INSIDE HUMAN BRAIN

  • Photon migration in 3D

turbid media

  • Prediction of experimental
  • utcomes
  • Simulation is a time-

consuming task

2 GTC April 4-7, 2016 | Silicon Valley

slide-3
SLIDE 3

GTC April 4-7, 2016 | Silicon Valley 3

MCX.SPACE

slide-4
SLIDE 4

MCX AROUND THE WORLD

ž Over 30,000 unique visits made from

148 countries

ž Accumulative download is over 12,000

worldwide

ž Over 900 registered users, from more

than 350 institutions/companies around the world

GTC April 4-7, 2016 | Silicon Valley 4

slide-5
SLIDE 5

MCX STATISTICS

GTC April 4-7, 2016 | Silicon Valley 5

slide-6
SLIDE 6

OUTLINE

ž Portable Performance Monte Carlo

Extreme (MCX)

— MCX in CUDA — Persistent Threads in CUDA (MCX) — Portable Performance MCX — Other enhacements — Results

ž MCX on multiple GPUs

— Performance Model — Partitioning Schemes — Performance Results

6 GTC April 4-7, 2016 | Silicon Valley

slide-7
SLIDE 7

PORTABLE PERFORMANCE MCX

Photons initialization

3D voxelated media

7 GTC April 4-7, 2016 | Silicon Valley

slide-8
SLIDE 8

MONTE CARLO EXTREME (MCX)

ž Estimates the 3D light (fluence) distribution by

simulating a large number of independent photons

ž Most accurate algorithm for a wide ranges of

  • ptical properties, including low-scattering/

voids, high absorption and short source- detector separation

ž Computationally intensive, so a great target for

GPU acceleration

ž Widely adopted for bio-optical imaging

applications:

— Optical brain functional imaging — Fluorescence imaging of small animals for drug development — Gold stand for validating new optical imaging instrumentation

designs and algorithms

8 GTC April 4-7, 2016 | Silicon Valley

slide-9
SLIDE 9

MCX APPLICATIONS

Imaging of bone marrow in the tibia Imaging of a complex mouse model using Monte Carlo simulations Simulation of photons inside human brain

9 GTC April 4-7, 2016 | Silicon Valley

slide-10
SLIDE 10

MCX IN CUDA [1]

Thread i Thread i+1 …

Launch a new photon Compute a new scattering length Propagate photon until cross voxel boundary Compute attenuation based on absorption Accumulate photon energy loss to the volume

End of scattering path? Total photon # reached?

Terminate thread

Exceeding time gate?

Compute a new scattering direction vector Global Memory

y y n n y n

Seed GPU RNG with CPU RNG

(optional) Repetition complete?

Retrieve solution Normalize & save solution

CPU GPU Start End of simulation Loop of repetitions

[1] Q. Fang and D. A. Boas. "Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units." Optics express 17.22 (2009): 20178-20190. 10 GTC April 4-7, 2016 | Silicon Valley

slide-11
SLIDE 11

PERSISTENT THREADS (PT) IN MCX

ž PT kernels alter the notion of a virtual thread lifetime,

treating those threads as physical hardware threads

ž PT kernels provide a view that threads are active for

the entire duration of the kernel

— We schedule only as many threads as the GPU SMs can

concurrently run

— The threads remain active until end of kernel execution Worker thread

Thread initializes and enter thread loop Thread loops continuously Thread exits loop, clean up, and shut down

11 GTC April 4-7, 2016 | Silicon Valley

slide-12
SLIDE 12

PORTABLE PERFORMANCE MCX

autoBlock = MaxThreadsPerMP / MaxBlocksPerMP autoThread = autoBlock * MaxBlocksPerMP* MP

Feature Fermi Kepler Maxwell MaxThreadBlocks/ MP 8 16 32 Maxthreads/MP 1536 2048 2058 MP 16 14 22 CUDA cores/MP 32 192 128

12 GTC April 4-7, 2016 | Silicon Valley

slide-13
SLIDE 13

OTHER ENHANCEMENTS

ž Autopilot improvement ž Developed customized operation such

as:

— mcx_nextafter

ž Reduced the use of SharedMemory

— Enables more threads to be launch

ž Avoided branch divergence by using

indexes

13 GTC April 4-7, 2016 | Silicon Valley

slide-14
SLIDE 14

IMPROVEMENT PER ENHANCEMENT

14 GTC April 4-7, 2016 | Silicon Valley

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% GK110 980Ti

Autopilot Reducing Shared Memory Increasing Local Memory/ Hide Latency Avoid branch divergence/ Customized function

Overall Performance

2.4x 1.4x

slide-15
SLIDE 15

PERFORMANCE MCX - RESULTS

Arch GPU Photons/ms (Baseline) Photons/ms Speedup

Fermi GTX 590 2044.99 2901.92 1.4x Kepler GT 730 529.89 1263.74 2.4x Kepler GK110 2383.22 5238.34 2.2x Maxwell 980Ti 12268.98 19157.09 1.4x

5000 10000 15000 20000 25000 GTX 590 GT 730 GK110 980Ti Photons GPUs

Performance (photons/ms)

15 GTC April 4-7, 2016 | Silicon Valley

ž Baseline: MCX version Sep 12, 2015

slide-16
SLIDE 16

MCX AS A BENCHMARK

GTC April 4-7, 2016 | Silicon Valley 16

MCX_core.ptx MCX_core.sass

0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 1800.00

Baseline After Improvement After Improvement with Hack

0.14 0.14 0.14 1799.76 1550.63 1368.96

Size (KB)

CUDA 7.5 - Maxwell Compute 5.2 (980Ti) +10x

  • 10x

Performance is changing dramatically

  • Same input
  • Same code of sequence
slide-17
SLIDE 17

MCX ON MULTIPLE GPUS

slide-18
SLIDE 18

MOTIVATION

ž Monte Carlo eXtreme (MCX) simulation in OpenCL

ž Distribute workloads among different devices

— NVIDIA GPUs / AMD GPUs / CPUs

thread thread thread GPU1

MCXCL

GPU2 GPU3

Platform Partitioning Scheme

18 GTC April 4-7, 2016 | Silicon Valley

slide-19
SLIDE 19

METHODOLOGY

ž Predict the kernel execution time

— Evaluate the kernel runtime — Develop the performance model

ž Partitioning Schemes Core-based

The number of parallel compute units

Throughput

Application throughput (photons/ms)

Iterative

Throughput- based iterative partitioning

Fminimax

Nonlinear linear programming solution for minimax problem

19 GTC April 4-7, 2016 | Silicon Valley

slide-20
SLIDE 20

PERFORMANCE MODEL

ž Measure the kernel execution time on various devices ž Simulate 1M to 25M photon migrations

20 GTC April 4-7, 2016 | Silicon Valley

slide-21
SLIDE 21

PERFORMANCE MODEL

ž Given n devices: D1, D2, … Dn ž Given linear performance for each device ž Given the performance for 1M and 2M for each device ž We can obtain the linear equation for each device as

follows:

y1 = a1x1 +c1

Device 1 :

. . . . Device 2 :

2 2 2 2

c x a y + =

Device n :

n n n n

c x a y + =

21 GTC April 4-7, 2016 | Silicon Valley

slide-22
SLIDE 22

PARTITIONING SCHEME ELABORATION

Core-based Initialization Iteratively evaluate throughput-based partitioning Stop when achieving the max throughput

Iterative Approximation

ComputeUnitsi ComputeUnitsi

Throughputi Throughputi

22 GTC April 4-7, 2016 | Silicon Valley

slide-23
SLIDE 23

PERFORMANCE RESULTS

5000 10000 15000 20000 25000 30000 10M 100M 10M 100M GTX 980 Ti + GTX 590 + GT 730 K40c + K20c

Core-based Throughput Iterative Fminimax

Max throughput 30323 photons/ms Max throughput 9688 photons/ms

10M 100M

Core-based 35.01% 41.65% Throughput 59.31% 93.42% Iterative 68.85% 93.77% Fminimax 68.85% 93.77%

10M 100M

Core-based 85.31% 97.56% Throughput 80.39% 87.89% Iterative 80.39% 87.89% Fminimax 80.39% 87.89%

Throughput Utilization Throughput Utilization

23 GTC April 4-7, 2016 | Silicon Valley

slide-24
SLIDE 24

PERFORMANCE RESULTS

500 1000 1500 2000 2500 3000 3500 4000 4500 10M 100M 10M 100M AMD 7970M + Intel i7-3740QM AMD 7970 + Fiji + Intel i7-4770

Core-based Throughput Iterative Fminimax

Max throughput 4529 photons/ms Max throughput 19176 photons/ms

10M 100M

Core-based 19.32% 18.69% Throughput 18.81% 27.14% Iterative 18.78% 27.91% Fminimax 18.78% 27.91%

10M 100M

Core-based 15.10% 19.06% Throughput 16.38% 21.10% Iterative 16.38% 21.10% Fminimax 16.38% 21.10%

Throughput Utilization Throughput Utilization

24 GTC April 4-7, 2016 | Silicon Valley

slide-25
SLIDE 25

SUMMARY

ž We have improved the performance of MCX

across a range of NVIDIA GPU architectures

ž We have showed how to exploit Persistent Thread

kernel to automatically tune MCX kernel

ž We developed an iterative scheme to search the

best partition to run MCX on multiple accelerators

ž We obtained an 24% and 44% throughput

utilization improvement (Iterative vs Core-based) for 10M and 100M photon simulations, respectively

25 GTC April 4-7, 2016 | Silicon Valley

slide-26
SLIDE 26

FUTURE WORK

ž Instrumentation of MCX

— Leverage SASSI to instrument MCX and

better characterize the behavior of a kernel to guide auto-tuning

ž MCX on Multiple GPUs

— Evaluate our partitioning optimization for

multiple devices

26 GTC April 4-7, 2016 | Silicon Valley

slide-27
SLIDE 27

MCX CHALLENGE

ž Interested in improving performance of

MCX over 40% compared to current version?

— Monetary reward will be announced soon.

Stay tuned to mcx.space

27 GTC April 4-7, 2016 | Silicon Valley

slide-28
SLIDE 28

ACKNOWLEDGEMENT

ž This project is funded by the NIH/NIGMS

under the grant R01-GM114365

ž We would like to acknowledge NVIDIA

for their support for this work through the NVIDIA Research Center program

28 GTC April 4-7, 2016 | Silicon Valley

slide-29
SLIDE 29

THANK YOU!

QUESTIONS? fninaparavecino@ece.neu.edu ylm@ece.neu.edu