Outline Problem Definition Summary of Work Done Space Filling - - PowerPoint PPT Presentation
Outline Problem Definition Summary of Work Done Space Filling - - PowerPoint PPT Presentation
Outline Problem Definition Summary of Work Done Space Filling Curves Bottom-Up Octree construction on the GPU Timing Results The Fast Multipole Method Timing and Quality Results Conclusion Problem Definition To
Outline
Problem Definition Summary of Work Done Space Filling Curves Bottom-Up Octree construction on the GPU Timing Results The Fast Multipole Method Timing and Quality Results Conclusion
Problem Definition
To provide an efficient, parallel GPU based Global Illumination solution for point models which is many folds faster than its corresponding CPU implementation. INPUT: A 3-D point model with attributes like 3-D coordinates, default surface diffuse color, emmisivity and surface normals OUTPUT: A fast parallel Global Illumination solution showing effects like color bleeding and soft shadows
Global Illumination problem is a N-Body problem since each particle is
affected by the presence of all other particles (Quadratic in nature)
The input data (points models in our case) is very large in size. More
than 105 particles.
Direct computations (on GPU) are not possible because of high
memory requirements to utilize the available parallelism
FMM solves the quadratic N-body problem efficiently in linear time by Approximating the solution to a user defined accuracy Using a hierarchical data structure (in our case, the octree)
FMM for Global Illumination on GPU?
Octree on GPU
Non Adaptive Adaptive Top Down Adaptive Bottom Up
Very fast but memory inefficient
Parent child relations calculated using direct SFC indexing
Published as a poster in I3D, 2008
Fast
Memory efficient as compared to the non- adaptive version
Intend to submit as a paper for consideration Post order traversal Location of a leaf cell containing the queried point Least Common Ancestor of two cells etc.
Contributions FMM on GPU View independent visibility on GPU
Fast method to calculate visibility between all point pairs in parallel
Required for correct global illumination
Submitted to ICVGIP, 2008 for oral paper
Fast parallel global illumination for point models
Intend to submit as a paper for consideration
Acknowledgements: Rhushabh Goradia
- Prof. Srinivas Aluru
Space Filling Curves
Consider the
recursive bisection of a 2D area into non-overlapping cells of equal size
A
is a mapping of these cells to a
- ne dimensional linear ordering
We consider the z-sfc or Morton ordering
Z-SFC for k = 2 Index of the cell with coordinates
Octrees and SFCs
- 1. Octrees can be viewed as multiple SFCs
at various resolutions
- 2. Parent can be generated from child’s SFC
- 3. To establish a total order on the cells of
- ctree: given 2 cells
a) if one is contained in the other, the subcell is taken to precede the supercell b) if disjoint, order according to the order of immediate subcells of the smallest supercell enclosing them
The resulting linearization is identical to traversal
Octrees
1 2 3 4 5 6 7 8 10 9
7 1 3 2 9 4 8 5 6 10
chains
Compressed Octrees
1 2 3 4 5 6 7 8 10 9
7 1 3 2 9 4 8 5 6 10
Each node in compressed octree is either a
leaf or has at least 2 children
Memory Efficient Bottom-Up Octree on the GPU
Intuitions
INPUT: n points (say a bunny) belonging to some 3-D domain OUTPUT: Octree represented in post-order with parent-child relationships established BOTTOM-UP TRAVERSAL
Since every internal node in an octree has leaves in its subtree, given the leaves we can somehow decode this hierarchical inheritance information and generate the internal nodes.
PARALLEL STRATEGY
Each internal node can be considered as a LCA of some particular leaf pairs (in a
compressed octree).
Given the leaves, generation of internal nodes can be parallelized since each of
them can be generated independently from a leaf pair.
Many leaf pairs might have the same LCA node resulting in duplicates which can
be easily detected and removed. Parent-Child relationships can be established and octree can be generated from a given compressed octree using SFC indices across multiple levels.
GPU CPU 2000 4000 6000 8000 10000 5 6 7 8 9 GPU CPU
Bunny (124531 points) Tree level GPU (ms) CPU (ms)
5 1218 1101 6 1482 1692 7 2041 2621 8 2501 4291 9 3669 9645
Results
GPU CPU 2000 4000 6000 8000 10000 5 6 7 8 9 GPU CPU
Ganesha (165646 points) Tree level GPU (ms) CPU (ms)
5 1463 1200 6 1762 1981 7 2396 2965 8 2923 4691 9 4501 8945
Results
Fast Multipole Method
Fast Multipole Method
The is concerned with evaluating the effect
- f a “set of sources” , on a set of “evaluation points” . More formally,
given we wish to evaluate the sum
Total Complexity:
attempts to reduce this complexity to
The two main insights that make this possible are
- f the kernel into source and receiver terms
Many application domains do not require the function be
calculated at high accuracy
FMM follows a Each node has associated
Each node has two kind of interaction lists from where the transfer of energy takes place
Far Cell List Near Cell List No far cell list at level 1 and level 0 since everything is near neighbor of other Transfer of energy from near neighbors happens only for leaves
FMM: Building Interaction Lists
FMM Algorithm
Step1: GPU implementation
PARALLELIZATION STRATEGIES
1) Multiple threads per leaf (one thread per particle)
One thread produces multipole expansion for each particle in the leaf Drawbacks: a) After generation of expansions they need to be consolidated, which will necessitate data transfer to GPU global memory (expensive) b) Fixed thread block size on GPU during run time. So, some threads may remain idle.
2) One thread per leaf
One thread produces full multipole expansion for the entire leaf Advantage: Work of each thread is completely independent and so there is no need for shared memory When each leaf may have different number of particles, as the thread that finishes work for a given leaf simply takes care of another leaf, without waiting or need for synchronization with other threads. Drawback: To realize the full GPU load the number of leaves should be sufficiently large (atleast 8192).
FMM Algorithm
For each level l = lmax-1, ... 2
Step2: GPU implementation
PARALLELIZATION STRATEGIES Iterate from the last level onto the root (root is at level 0) For every level, allocate,
One thread per parent node
One thread produces multipole-to-multipole translations for all the children Drawback: a) GPU load becomes very small at low lmax (maximum lavels) Upward pass is not highly compute intensive as compared to the downward pass. The total time taken by the upward pass is actually 1% of the total time taken by the downward pass on CPU.
FMM Algorithm
Most Expensive Step of the Algorithm PARALLELIZATION: Iterate from level 2 to last, compute the Multipole to Local Translation for each node at current level in parallel.
FMM Algorithm
PARALLELIZATION: Iterate from level 2 to last, compute the Local to Local Translation for each node at current level in parallel.
Only for leaves of the Quadtree
FMM Algorithm
PARALLELIZATION: Iterate from level 2 to last, if leaf, compute the Local Expansion for leaves at current level in parallel PARALLELIZATION: Iterate from level 2 to last, if leaf, each thread performs all near-neighbor computations for a particular leaf.
Results: Quality Comparisons
CPU GPU
Bunny (124531 points)
Results: Quality Comparisons
CPU GPU
Ganesha (165646 points)
GPU CPU 5 10 15 20 25 30 200 150 100 50 25 GPU CPU
Bunny (124531 points)
Number of points per leaf GPU (hr) CPU (hr) GPU speedup 200 1.01 15.96 15.8 150 1.09 19.18 17.6 100 1.16 21.11 18.2 50 1.21 23.81 19.5 25 1.30 25.87 19.9
Results: Timing Comparisons (without visibility)
GPU CPU 5 10 15 20 25 30 200 150 100 50 25 GPU CPU
Ganpati (165646 points)
Number of points per leaf GPU (hr) CPU (hr) GPU Speedup 200 1.11 14.54 13.1 150 1.16 16.58 14.3 100 1.21 20.81 17.2 50 1.28 23.15 18.1 25 1.41 26.37 18.7
Results: Timing Comparisons (without visibility)
Conclusion
- 1. Non Adaptive octree (speedups of upto 500 times)
- 2. Adaptive octrees (speedups of upto 3 times)
- 3. FMM on the GPU for Global illumination (speedups of upto 20 times)
Future Work
All the 3 steps above can be done on GPU to make a complete system
References
- L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of
Computational Physics, 73:325–348, 1987.
- J. Carrier, L. Greengard, and V. Rokhlin. A Fast Adaptive Multipole Algorithm for
Particle Simulations. SIAM Journal of Scientific and Statistical Computing, 9:669– 686, July 1988.
- R. Beatson and L. Greengard. A Short Course on Fast Multipole Methods.
- H. Sagan. Space Filling Curves. Springer-Verlag, 1994.
- S. Seal and S. Aluru. Spatial Domain Decomposition Methods in Parallel Scientific
- Computing. Book chapter.
- N. A. Gumerov and R. Duraiswami. Fast Multipole Method on Graphics
- Processors. AstroGPU 2007.
John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krger, Aaron
- E. Lefohn, and Timothy J. Purcell. A survey of general-purpose computation on
graphics hardware. Computer Graphics Forum, 26(1):80–113, 2007.
Nvidia CUDA Programming Guide. http://developer.nvidia.com/cuda
P. Ajmera, R. Goradia, S. Chandran, and S. Aluru. Fast, parallel, gpu-based
construction of space filling curves and octrees. In SI3D '08: Proceedings of the 2008 symposium on Interactive 3D graphics and games, pages 1–1, New York, NY, USA, 2008. ACM.
S. Lefebvre, S. Hornus, and F. Neyret. GPU Gems 2, Octree Textures on the GPU,
pages 595–614. Addison Wesley, 2005.
L. Nyland, M. Harris, and J. Prins. GPU Gems 3, chapter Fast N-Body Simulation
with CUDA, pages 677–696. Addison Wesley, 2007.
A. Karapurkar and S. Chandran. Fmm-based global illumination for polygonal
- models. Master's thesis, Indian Institute of Technology, Bombay, 2003.
R. Goradia, A. Kanakanti, S. Chandran, and A. Datta. Visibility map for global
illumination in point clouds. Proceedings of ACM SIGGRAPH GRAPHITE, 5th International Conference on Computer Graphics and Interactive Techniques, 2007.
CUDA Data Parallel Primitives Library. http://www.gpgpu.org/developer/cudpp
References contd..
Various Stages of Octree Construction
Next: 1st two steps
Step 1: Constructing Leaves
(a) Read n points in the first n locations of an array A of size 2n − 1 (see Fig. (a)) (b) Assuming a point per leaf, for every point, in parallel, do
- i. Generate the 3D co-ordinate of leaf cell to which it belongs.
- ii. Generate SFC index for the leaf cell.
(c) Sort the first n elements of array A, in parallel, based on SFC indices of leaves (see Fig. (b))
Step 2: Generating Internal Nodes & Post Order
In Parallel, for every adjacent leaves, find their LCA using the common bits (multiple
- f 3; 3 being the dimension) in their SFC indices. (see Fig. (c))
- Eg. say adjacent leaves L1 and L2 have their SFC indices as 100 101 110 010 and
100 101 100 001 respectively, then the LCA is the internal node having SFC index 100 101 Note that duplicate nodes may also get generated
??? Prove that all the internal nodes get generated ???
Step 2: Generating Internal Nodes & Post Order
Sort, in parallel, the internal nodes generated, across levels based on their SFC indices and remove duplicates in parallel from the later half of array (see Figs. (c) & (d)) To establish a total order on the cells across levels, a) If one is contained in the other, the subcell is taken to precede the supercell b) If disjoint, order according to the order of the immediate subcells of the smallest supercell enclosing them
Step 2: Generating Internal Nodes & Post Order
Sort array A, in parallel, across levels based on the SFC indices to get the post order
- f a compressed octree
??? What queries does this octree support ??? Query of NE, NW, SW neighbours ??? Empty nodes are not shown
Step 3: Parent Child Relationships in Compressed Octree
INTUITION
- 1. Tree is in the post order fashion
- 2. LCA of every two adjacent nodes will definitely be the parent of the first
node in the pair considered.
Step 3: Parent Child Relationships in Compressed Octree
Step 3: Parent Child Relationships in Compressed Octree
Step 4: Compressed Octree to Octree
Step 4: Compressed Octree to Octree
Queries Supported
IS NODE C1 CONTAINED IN NODE C2? C1 is contained in C2 if and only if the SFC value of C2 is a prefix of the SFC value of C1 GIVEN C2 AS A DESCENDANT OF C1, RETURN CHILD OF C1 CONTAINING C2 For dimension d and level l , dl is the number of bits representing C1.The required child is given by the first d(l+1) bits of C2. LEAST COMMON ANCESTOR OF NODES C1 AND C2 The longest common prefix of the SFC values of C1 and C2 which is a multiple of dimension d gives us the least common ancestor. PARENT-CHILD RELATIONSHIP For dimension d and level l, if dl is the number of bits in the SFC index representing child C1, then the parent can be directly given by its first d(l − 1) bits. PARALLEL POST-ORDER TRAVERSAL Since our output is in post-ordered form, this query is implicitly answered. GIVEN A POINT (Px, Py, Pz ), FIND WHICH NODE IT BELONGS TO The co-ordinates of the desired node are (|_2kPx/D_|, |_2kPy/D_|, |_2kPz/D_|), where k is the number of times the space has been bisected and D is the sidelength of space enclosing all points in the model.
GPU Optimizations
LOOP UNROLLING OPTIMAL THREAD AND BLOCK SIZE OPTIMAL OCTREE DEPTHS
GPU Hardware Model
Constant Cache Texture Cache Shared Memory
Processor M
Instruction Unit
Processor 2 Processor 1
Registers Registers Registers
Multiprocessor 1 Multiprocessor 2 Multiprocessor N Device Device Memory
16 Multiprocessors 8 processors per multiprocessor 320 Mbs of device memory (slow) 8192 registers per multiprocessor (Very fast) Total constant 64Kb (Very fast) 16Kb shared memory within each multiprocessor (Very fast)
CUDA Programming Model
Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1)
Kernel
Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2)
Block of threads Grids of thread blocks Any computation that is done
independently on different data many times, can be isolated into a function called kernel that is executed on the GPU as many different threads
A GPU may run all the blocks of a
grid sequentially if it has very few parallel capabilities, or in parallel if it has a lot of parallel capabilities.
Limitations No dynamic memory allocation No recursion
- n the GPU supported