[PPT] - Outline Problem Definition Summary of Work Done Space Filling PowerPoint Presentation

SLIDE 1

SLIDE 2

Outline

 Problem Definition  Summary of Work Done  Space Filling Curves  Bottom-Up Octree construction on the GPU  Timing Results  The Fast Multipole Method  Timing and Quality Results  Conclusion

SLIDE 3

Problem Definition

To provide an efficient, parallel GPU based Global Illumination solution for point models which is many folds faster than its corresponding CPU implementation. INPUT: A 3-D point model with attributes like 3-D coordinates, default surface diffuse color, emmisivity and surface normals OUTPUT: A fast parallel Global Illumination solution showing effects like color bleeding and soft shadows

SLIDE 4

 Global Illumination problem is a N-Body problem since each particle is

affected by the presence of all other particles (Quadratic in nature)

 The input data (points models in our case) is very large in size. More

than 105 particles.

 Direct computations (on GPU) are not possible because of high

memory requirements to utilize the available parallelism

 FMM solves the quadratic N-body problem efficiently in linear time by  Approximating the solution to a user defined accuracy  Using a hierarchical data structure (in our case, the octree)

FMM for Global Illumination on GPU?

SLIDE 5

Octree on GPU

Non Adaptive Adaptive Top Down Adaptive Bottom Up



Very fast but memory inefficient



Parent child relations calculated using direct SFC indexing



Published as a poster in I3D, 2008



Fast



Memory efficient as compared to the non- adaptive version



Intend to submit as a paper for consideration Post order traversal Location of a leaf cell containing the queried point Least Common Ancestor of two cells etc.

Contributions FMM on GPU View independent visibility on GPU



Fast method to calculate visibility between all point pairs in parallel



Required for correct global illumination



Submitted to ICVGIP, 2008 for oral paper



Fast parallel global illumination for point models



Intend to submit as a paper for consideration

Acknowledgements: Rhushabh Goradia

Prof. Srinivas Aluru

SLIDE 6

Space Filling Curves

 Consider the

recursive bisection of a 2D area into non-overlapping cells of equal size

 A

is a mapping of these cells to a

ne dimensional linear ordering

 We consider the z-sfc or Morton ordering

Z-SFC for k = 2 Index of the cell with coordinates

SLIDE 7

Octrees and SFCs

1. Octrees can be viewed as multiple SFCs

at various resolutions

2. Parent can be generated from child’s SFC
3. To establish a total order on the cells of
ctree: given 2 cells

a) if one is contained in the other, the subcell is taken to precede the supercell b) if disjoint, order according to the order of immediate subcells of the smallest supercell enclosing them

The resulting linearization is identical to traversal

SLIDE 8

Octrees

1 2 3 4 5 6 7 8 10 9

7 1 3 2 9 4 8 5 6 10

chains

SLIDE 9

Compressed Octrees

1 2 3 4 5 6 7 8 10 9

7 1 3 2 9 4 8 5 6 10

 Each node in compressed octree is either a

leaf or has at least 2 children

SLIDE 10

Memory Efficient Bottom-Up Octree on the GPU

SLIDE 11

Intuitions

INPUT: n points (say a bunny) belonging to some 3-D domain OUTPUT: Octree represented in post-order with parent-child relationships established BOTTOM-UP TRAVERSAL

Since every internal node in an octree has leaves in its subtree, given the leaves we can somehow decode this hierarchical inheritance information and generate the internal nodes.

PARALLEL STRATEGY

 Each internal node can be considered as a LCA of some particular leaf pairs (in a

compressed octree).

 Given the leaves, generation of internal nodes can be parallelized since each of

them can be generated independently from a leaf pair.

 Many leaf pairs might have the same LCA node resulting in duplicates which can

be easily detected and removed. Parent-Child relationships can be established and octree can be generated from a given compressed octree using SFC indices across multiple levels.

SLIDE 12

GPU CPU 2000 4000 6000 8000 10000 5 6 7 8 9 GPU CPU

Bunny (124531 points) Tree level GPU (ms) CPU (ms)

5 1218 1101 6 1482 1692 7 2041 2621 8 2501 4291 9 3669 9645

Results

SLIDE 13

GPU CPU 2000 4000 6000 8000 10000 5 6 7 8 9 GPU CPU

Ganesha (165646 points) Tree level GPU (ms) CPU (ms)

5 1463 1200 6 1762 1981 7 2396 2965 8 2923 4691 9 4501 8945

Results

SLIDE 14

Fast Multipole Method

SLIDE 15

Fast Multipole Method

The is concerned with evaluating the effect

f a “set of sources” , on a set of “evaluation points” . More formally,

given we wish to evaluate the sum

 Total Complexity:

SLIDE 16



attempts to reduce this complexity to

 The two main insights that make this possible are 

f the kernel into source and receiver terms

 Many application domains do not require the function be

calculated at high accuracy

 FMM follows a  Each node has associated

SLIDE 17

Each node has two kind of interaction lists from where the transfer of energy takes place

 Far Cell List  Near Cell List  No far cell list at level 1 and level 0 since everything is near neighbor of other  Transfer of energy from near neighbors happens only for leaves

FMM: Building Interaction Lists

SLIDE 18

FMM Algorithm

SLIDE 19

Step1: GPU implementation

PARALLELIZATION STRATEGIES

1) Multiple threads per leaf (one thread per particle)

One thread produces multipole expansion for each particle in the leaf Drawbacks: a) After generation of expansions they need to be consolidated, which will necessitate data transfer to GPU global memory (expensive) b) Fixed thread block size on GPU during run time. So, some threads may remain idle.

2) One thread per leaf

One thread produces full multipole expansion for the entire leaf Advantage: Work of each thread is completely independent and so there is no need for shared memory When each leaf may have different number of particles, as the thread that finishes work for a given leaf simply takes care of another leaf, without waiting or need for synchronization with other threads. Drawback: To realize the full GPU load the number of leaves should be sufficiently large (atleast 8192).

SLIDE 20

FMM Algorithm

For each level l = lmax-1, ... 2

SLIDE 21

Step2: GPU implementation

PARALLELIZATION STRATEGIES Iterate from the last level onto the root (root is at level 0) For every level, allocate,

One thread per parent node

One thread produces multipole-to-multipole translations for all the children Drawback: a) GPU load becomes very small at low lmax (maximum lavels) Upward pass is not highly compute intensive as compared to the downward pass. The total time taken by the upward pass is actually 1% of the total time taken by the downward pass on CPU.

SLIDE 22

FMM Algorithm

Most Expensive Step of the Algorithm PARALLELIZATION: Iterate from level 2 to last, compute the Multipole to Local Translation for each node at current level in parallel.

SLIDE 23

FMM Algorithm

PARALLELIZATION: Iterate from level 2 to last, compute the Local to Local Translation for each node at current level in parallel.

SLIDE 24

Only for leaves of the Quadtree

FMM Algorithm

PARALLELIZATION: Iterate from level 2 to last, if leaf, compute the Local Expansion for leaves at current level in parallel PARALLELIZATION: Iterate from level 2 to last, if leaf, each thread performs all near-neighbor computations for a particular leaf.

SLIDE 25

Results: Quality Comparisons

CPU GPU

Bunny (124531 points)

SLIDE 26

Results: Quality Comparisons

CPU GPU

Ganesha (165646 points)

SLIDE 27

GPU CPU 5 10 15 20 25 30 200 150 100 50 25 GPU CPU

Bunny (124531 points)

Number of points per leaf GPU (hr) CPU (hr) GPU speedup 200 1.01 15.96 15.8 150 1.09 19.18 17.6 100 1.16 21.11 18.2 50 1.21 23.81 19.5 25 1.30 25.87 19.9

Results: Timing Comparisons (without visibility)

SLIDE 28

GPU CPU 5 10 15 20 25 30 200 150 100 50 25 GPU CPU

Ganpati (165646 points)

Number of points per leaf GPU (hr) CPU (hr) GPU Speedup 200 1.11 14.54 13.1 150 1.16 16.58 14.3 100 1.21 20.81 17.2 50 1.28 23.15 18.1 25 1.41 26.37 18.7

Results: Timing Comparisons (without visibility)

SLIDE 29

Conclusion

1. Non Adaptive octree (speedups of upto 500 times)
2. Adaptive octrees (speedups of upto 3 times)
3. FMM on the GPU for Global illumination (speedups of upto 20 times)

Future Work

All the 3 steps above can be done on GPU to make a complete system

SLIDE 30

References



L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of

Computational Physics, 73:325–348, 1987.



J. Carrier, L. Greengard, and V. Rokhlin. A Fast Adaptive Multipole Algorithm for

Particle Simulations. SIAM Journal of Scientific and Statistical Computing, 9:669– 686, July 1988.



R. Beatson and L. Greengard. A Short Course on Fast Multipole Methods.



H. Sagan. Space Filling Curves. Springer-Verlag, 1994.



S. Seal and S. Aluru. Spatial Domain Decomposition Methods in Parallel Scientific
Computing. Book chapter.



N. A. Gumerov and R. Duraiswami. Fast Multipole Method on Graphics
Processors. AstroGPU 2007.



John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krger, Aaron

E. Lefohn, and Timothy J. Purcell. A survey of general-purpose computation on

graphics hardware. Computer Graphics Forum, 26(1):80–113, 2007.



Nvidia CUDA Programming Guide. http://developer.nvidia.com/cuda

SLIDE 31

 P. Ajmera, R. Goradia, S. Chandran, and S. Aluru. Fast, parallel, gpu-based

construction of space filling curves and octrees. In SI3D '08: Proceedings of the 2008 symposium on Interactive 3D graphics and games, pages 1–1, New York, NY, USA, 2008. ACM.

 S. Lefebvre, S. Hornus, and F. Neyret. GPU Gems 2, Octree Textures on the GPU,

pages 595–614. Addison Wesley, 2005.

 L. Nyland, M. Harris, and J. Prins. GPU Gems 3, chapter Fast N-Body Simulation

with CUDA, pages 677–696. Addison Wesley, 2007.

 A. Karapurkar and S. Chandran. Fmm-based global illumination for polygonal

models. Master's thesis, Indian Institute of Technology, Bombay, 2003.

 R. Goradia, A. Kanakanti, S. Chandran, and A. Datta. Visibility map for global

illumination in point clouds. Proceedings of ACM SIGGRAPH GRAPHITE, 5th International Conference on Computer Graphics and Interactive Techniques, 2007.

 CUDA Data Parallel Primitives Library. http://www.gpgpu.org/developer/cudpp

References contd..

SLIDE 32

SLIDE 33

Various Stages of Octree Construction

Next: 1st two steps

SLIDE 34

Step 1: Constructing Leaves

(a) Read n points in the first n locations of an array A of size 2n − 1 (see Fig. (a)) (b) Assuming a point per leaf, for every point, in parallel, do

i. Generate the 3D co-ordinate of leaf cell to which it belongs.
ii. Generate SFC index for the leaf cell.

(c) Sort the first n elements of array A, in parallel, based on SFC indices of leaves (see Fig. (b))

SLIDE 35

Step 2: Generating Internal Nodes & Post Order

In Parallel, for every adjacent leaves, find their LCA using the common bits (multiple

f 3; 3 being the dimension) in their SFC indices. (see Fig. (c))
Eg. say adjacent leaves L1 and L2 have their SFC indices as 100 101 110 010 and

100 101 100 001 respectively, then the LCA is the internal node having SFC index 100 101 Note that duplicate nodes may also get generated

??? Prove that all the internal nodes get generated ???

SLIDE 36

Step 2: Generating Internal Nodes & Post Order

Sort, in parallel, the internal nodes generated, across levels based on their SFC indices and remove duplicates in parallel from the later half of array (see Figs. (c) & (d)) To establish a total order on the cells across levels, a) If one is contained in the other, the subcell is taken to precede the supercell b) If disjoint, order according to the order of the immediate subcells of the smallest supercell enclosing them

SLIDE 37

Step 2: Generating Internal Nodes & Post Order

Sort array A, in parallel, across levels based on the SFC indices to get the post order

f a compressed octree

??? What queries does this octree support ??? Query of NE, NW, SW neighbours ??? Empty nodes are not shown

SLIDE 38

Step 3: Parent Child Relationships in Compressed Octree

INTUITION

1. Tree is in the post order fashion
2. LCA of every two adjacent nodes will definitely be the parent of the first

node in the pair considered.

SLIDE 39

Step 3: Parent Child Relationships in Compressed Octree

SLIDE 40

Step 3: Parent Child Relationships in Compressed Octree

SLIDE 41

Step 4: Compressed Octree to Octree

SLIDE 42

Step 4: Compressed Octree to Octree

SLIDE 43

Queries Supported

IS NODE C1 CONTAINED IN NODE C2? C1 is contained in C2 if and only if the SFC value of C2 is a prefix of the SFC value of C1 GIVEN C2 AS A DESCENDANT OF C1, RETURN CHILD OF C1 CONTAINING C2 For dimension d and level l , dl is the number of bits representing C1.The required child is given by the first d(l+1) bits of C2. LEAST COMMON ANCESTOR OF NODES C1 AND C2 The longest common prefix of the SFC values of C1 and C2 which is a multiple of dimension d gives us the least common ancestor. PARENT-CHILD RELATIONSHIP For dimension d and level l, if dl is the number of bits in the SFC index representing child C1, then the parent can be directly given by its first d(l − 1) bits. PARALLEL POST-ORDER TRAVERSAL Since our output is in post-ordered form, this query is implicitly answered. GIVEN A POINT (Px, Py, Pz ), FIND WHICH NODE IT BELONGS TO The co-ordinates of the desired node are (|_2kPx/D_|, |_2kPy/D_|, |_2kPz/D_|), where k is the number of times the space has been bisected and D is the sidelength of space enclosing all points in the model.

SLIDE 44

GPU Optimizations

LOOP UNROLLING OPTIMAL THREAD AND BLOCK SIZE OPTIMAL OCTREE DEPTHS

SLIDE 45

GPU Hardware Model

Constant Cache Texture Cache Shared Memory

Processor M

Instruction Unit

Processor 2 Processor 1

Registers Registers Registers

Multiprocessor 1 Multiprocessor 2 Multiprocessor N Device Device Memory

16 Multiprocessors 8 processors per multiprocessor 320 Mbs of device memory (slow) 8192 registers per multiprocessor (Very fast) Total constant 64Kb (Very fast) 16Kb shared memory within each multiprocessor (Very fast)

SLIDE 46

CUDA Programming Model

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1)

Kernel

Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2)

 Block of threads  Grids of thread blocks  Any computation that is done

independently on different data many times, can be isolated into a function called kernel that is executed on the GPU as many different threads

 A GPU may run all the blocks of a

grid sequentially if it has very few parallel capabilities, or in parallel if it has a lot of parallel capabilities.

 Limitations  No dynamic memory allocation  No recursion

n the GPU supported