WO2009063319A2 - Hiérarchies de volumes englobants superficielles pour un lancer de rayon accéléré - Google Patents

Hiérarchies de volumes englobants superficielles pour un lancer de rayon accéléré Download PDF

Info

Publication number
WO2009063319A2
WO2009063319A2 PCT/IB2008/003402 IB2008003402W WO2009063319A2 WO 2009063319 A2 WO2009063319 A2 WO 2009063319A2 IB 2008003402 W IB2008003402 W IB 2008003402W WO 2009063319 A2 WO2009063319 A2 WO 2009063319A2
Authority
WO
WIPO (PCT)
Prior art keywords
computer
ray
scene
further improvement
tracing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/IB2008/003402
Other languages
English (en)
Other versions
WO2009063319A3 (fr
Inventor
Holger Dammertz
Alexander Keller
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nvidia ARC GmbH
Original Assignee
Mental Images GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mental Images GmbH filed Critical Mental Images GmbH
Publication of WO2009063319A2 publication Critical patent/WO2009063319A2/fr
Publication of WO2009063319A3 publication Critical patent/WO2009063319A3/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three-dimensional [3D] modelling for computer graphics
    • G06T17/005Tree description, e.g. octree, quadtree
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/00Three-dimensional [3D] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/00Three-dimensional [3D] image rendering
    • G06T15/06Ray-tracing

Definitions

  • the present invention relates to systems, methods, devices, architectures, apparatus and computer software/program code products for computer graphics.
  • Photorealistic image synthesis typically requires the simulation of global illumination, which can typically only be approximated in a consistent manner by using ray tracing. Contrary to rasterization, ray tracing allows one to follow arbitrary transport paths of light.
  • Algorithms that perform physically correct simulations of light transport tend to shoot rays as wide-spread as possible in order to increase efficiency. As a result, most rays that account for global illumination effects are incoherent.
  • WaId, L "Realtime Ray Tracing and Interactive Global Illumination,” Ph.D. Thesis, Saarland University (2004). WaId, I.; Benthin, C; Wagner, M.; and Slusallck, P., “Interactive Rendering with Coherent Ray Tracing,” EUROGRAPHICS 2001 , 20(3): 153-164 (2001).
  • SIMD instructions arc typically used in a conventional manner by using multiple objects simultaneously, e.g., 4 triangles at a time, 4 bounding boxes at a time, and several split planes at a time. Compared to tracing packets, this has the major disadvantage that memory bandwidth is not reduced.
  • the invention provides systems, devices, methods and computer program code (software) products for, among other aspects and possible applications, enabling computer graphics systems to accurately and efficiently render images.
  • Systems, devices, methods and computer program code (software) products in accordance with the invention are suitable for implementation or execution in, or in conjunction with, a computer graphics system including a computer for rendering images for storage or for display, such as on a display element, wherein the rendering of an image comprises utilizing the computer and/or other elements of the computer graphics system to generate pixel values corresponding to pixels in an image representation.
  • Systems, devices, methods and computer program code (software) products in accordance with the present invention are suitable for implementation or execution in, or in conjunction with, a wide range of commercially available computer graphics systems or software environments, such as those available from MENTAL IMAGES GMBH of Berlin, Germany.
  • One aspect of the present invention relates to a computer graphics system comprising a computer and a display element, the display element being operable to display a human- perceptible image in response to a display-controlling electrical output from the computer, the computer being operable to generate the display-controlling electrical output based on calculations of pixel values for pixels in the image, respective pixel values being representative of at least one point in a scene as recorded on an image plane of a simulated camera, the computer being operable to generate pixel values for an image using a ray-tracing methodology, the ray- tracing methodology comprising the simulating of at least one ray shot from the pixel into a scene, or from the scene to a pixel, along a selected direction, the ray-tracing methodology further comprising the calculating of the intersections of rays and surfaces of objects in the scene and the simulating of trajectories of rays illuminating objects in the scene.
  • the invention comprises methods, systems, devices and computer program code (software) for accelerating ray tracing by using acceleration data structures with high arity to enable processing of nodes using streaming SIMD (Single Instruction, Multiple Data) instructions with reduced memory requirements.
  • SIMD Single Instruction, Multiple Data
  • the invention comprises methods, systems, devices and computer program code (software) employing shallow bounding volume hierarchies (BVH) to enable rapid tracing of incoherent rays using streaming SIMD (Single Instruction, Multiple Data) instructions.
  • VH shallow bounding volume hierarchies
  • SIMD instructions are used in construction of bounding volume hierarchies, tree traversal and leaf intersection.
  • streaming SIMD instructions is enabled by increasing the arity of acceleration structures, which also reduces memory requirements.
  • caching can be utilized to accelerate processing of shadow rays in a global illumination process without increasing memory requirements.
  • sorted traversal can be utilized.
  • a further aspect of the invention includes methods, devices, systems or computer software code products (software) applying streaming SIMD instruction processing to leaf intersection, using a cache storing image-related triangles in SIMD-compatible form.
  • the SIMD-compatible form can include a vertex- index representation of triangles or other compressed representation.
  • the invention can also comprise sorting an additional index array.
  • flattening binary trees comprises collapsing a classical binary tree.
  • Collapsing a classical binary tree can comprise keeping each k-th level of a given tree, and discarding intermediate levels.
  • Another embodiment or practice of the invention can comprise utilizing a selected stack- based tree traversal technique.
  • Yet a further aspect of the invention includes methods, devices, systems or computer software code products (software) utilizing any of (or a combination of) triangle intersection processing, SIMD caching and a vertex-index data structure or other compressed representation.
  • aspects of the invention can include techniques for accelerating processing of shadow rays for point-light-based global illumination, which can comprise any of (or a combination of): quitting traversal on first intersection found, omitting sorting of nodes pushed on the stack, not computing intersection distances, and starting traversal deeper in the tree instead of from a root node.
  • the techniques for accelerating processing of shadow rays can also include utilizing a backtracking technique that enables a plurality of nodes of the tree to be used as an entry point that yields correct results.
  • a further aspect of the invention includes methods, devices, systems or computer software code products (software) for executing a numerically robust triangle subdivision technique that subdivides triangles only when required, and wherein a degree of subdivision is automatically selected.
  • the triangle subdivision technique can comprise an edge volume technique that evaluates the tightness of the bounding box of each triangle edge and subdivides the triangle until a selected threshold value is met.
  • the edge volume technique can further include determining, for each edge of a given triangle, its axis-aligned bounding box.
  • the edge volume technique can also include comparing, for a given triangle, the volume of the largest of the three boxes to a volume threshold, and if the volume is larger than the threshold, subdividing the triangle in the middle of the corresponding edge, and repeating the procedure for the two new triangles until the process is complete.
  • the edge volume threshold is determined as a fraction of the volume of the scene bounding box, thus controlling the number of triangles generated by subdivision.
  • the procedures described herein can be applied either as a pre-process before ray tracing, or for on-demand subdivision during ray tracing.
  • Another aspect of the invention can include reducing overlap of bounding volumes of geometry of a scene. Reducing overlap can comprise caching hierarchies for selected small portions of image components.
  • the invention can include methods, devices, systems or computer software code products (software) for generating high arity data structures by collapsing classic data structure hierarchies by merging levels of the hierarchies.
  • acceleration data structures are cached, and may be employed to construct a top-level acceleration data structure.
  • the techniques and procedures described herein of employing shallow BVH and using streaming SIMD instructions, can be applied to at least one spatial partitioning scheme, which can include any of, or a combination of, ⁇ d-trees and BSP-trees.
  • the invention can also include applying streaming SIMD instructions to at least one object list partitioning scheme.
  • the object list partitioning scheme can include bounding volume hierarchies.
  • Another aspect of the invention can also include, in addition to one or a number of the techniques described above, executing at least one operation that traces packets of rays.
  • Still a further aspect of the invention comprises processing a non-axis aligned geometry using a partitioning heuristic.
  • the partitioning heuristic can be applied in any tessellation in order to provide crack free tessellation.
  • FIGS. 1-5 are a series of diagrams illustrating the splitting of a bounding volume according to a BVH technique.
  • FIGS. 6 and 7 are a pair of side-by-side diagrams illustrating the difference between a binary BVH scheme and a QBVH scheme.
  • FIG. 9 is a diagram illustrating how, using split plane indices stored during tree construction, the sign of the respective component of the ray direction vector determines whether or not to swap the two corresponding children.
  • FIG. 10 is a table setting forth a comparison of axes-based and distance-based sorting.
  • FIG. 1 1 is a table showing measured rendering performance from invocation of the renderer to the final image for three exemplary scenes.
  • FIG. 12 is a table providing statistics about tree quality and ray tracing behavior of different acceleration kernels.
  • FIG. 13 is a table providing a general idea of the effectiveness of the use of SIMD ins tractions.
  • FIG. 14 is a table showing the performance of acceleration structure with a priori bounded memory.
  • FIG. 15 is a table comparing rendering times and memory usage of a flat triangle layout with 36 bytes per triangle and a vertex-index based representation.
  • FIG. 16 is a diagram of a portion of a scene illustrating an aspect of the invention in the setting of a point- light-source-based global illumination renderer.
  • FIG. 17 is a table setting forth the number of box intersections and total time to image for a normal QBVH version and an entry point cache version.
  • FIGS. 18A-D are a series of diagrams illustrating recursive subdivision of a triangle along the longest edges.
  • FIG. 19 is a table setting forth factors of best, average, and worst surface area reduction resulting from applying the edge volume heuristic for triangle subdivision.
  • FIG. 20 shows graphs illustrating the relative increase in the number of triangles and corresponding rendering time for a range of threshold parameters t.
  • FIG. 21 shows graphs illustrating performance figures and number of generated triangle references over animation for several threshold parameters /.
  • FIGS. 22-25 are a series of exemplary code listings for various aspects of the invention.
  • FIGS. 26, 27A and 27B are a series of schematic diagrams illustrating a suitable digital processing environment for use in practice various described aspects of the present invention.
  • FIG. 28 is a diagram illustrating an overall method in accordance with a first aspect of the present invention.
  • FIG. 29 is diagram of a ray tracing procedure, illustrating the problem of self-intersection.
  • FIG. 30 shows a diagram, in elevation view, of a partitioned axis-aligned bounding box that is used as an acceleration data structure in accordance with a further aspect of the invention.
  • FIGS. 31-33 are a series of diagrams, in isometric view, of the axis-aligned bounding box shown in FIG. 30, illustrating the partitioning of the bounding box with L- and R-planes.
  • FIGS. 34 and 35 are flowcharts of ray tracing methods providing efficient construction of data acceleration structures.
  • FIG. 36 is a diagram of a general processing system according to various described aspects of the invention.
  • FIG. 37 is a flowchart of a general technique according to various described aspects of the invention.
  • Photorealistic image synthesis is a computationally demanding task that relies on ray tracing for the evaluation of integrals. Rendering time is dominated by tracing long paths that are very incoherent by construction.
  • the present invention therefore provides systems and techniques in which SIMD (Single Instruction, Multiple Data) instructions are used to accelerate incoherent rays.
  • SIMD Single Instruction, Multiple Data
  • SIMD is used in hierarchy construction, tree traversal and leaf intersection. This is achieved by increasing the arity of acceleration structures, which also reduces memory requirements. As shown below, the resulting hierarchies can be built quickly and are smaller than known acceleration structures, while at the same time outperforming them for incoherent rays.
  • the new acceleration structure speeds up ray tracing by a factor of 1.6 to 2.0 compared to a highly optimized bounding interval hierarchy implementation, and 1.3 to 1.6 compared to an efficient /cd-tree. At the same time, the memory requirements are reduced by 10-50%. Additionally, there is shown how a caching mechanism in conjunction with this memory efficient hierarchy can be used to speed up shadow rays in a global illumination algorithm without increasing the memory footprint. This optimization decreased the number of traversal steps up to 50%.
  • Bounding volume hierarchies belong to the simplest and most efficient acceleration schemes for ray tracing. Memory requirements and memory latency have been recognized as bottlenecks and fast tree construction has been investigated. Significant efforts have been expended with respect to the exploitation of SIMD instructions in modern processor environments for coherent packets of rays.
  • a technique according to the present invention increases the arity of the acceleration hierarchy, which saves memory for nodes and pointers.
  • the presently described techniques are applied in the context of efficient SIMD processing, and without the drawbacks of limited precision and median split tree construction.
  • a further aspect of the invention shows how to use the memory layout of the new acceleration structure to contain additional data used to transparently speed up coherent rays.
  • the described technique uses a form of backtracking without additional memory requirements and with entry points at every node.
  • the present invention further teaches how to construct the acceleration structure in this case while still minimizing the cost function.
  • the presently described data structure supports sorted traversal, which has been found to be crucial for good performance on current CPU architectures, especially for intersection rays.
  • Bandwidth problems are an issue for efficient streaming processing.
  • the present invention reduces memory consumption by flattening the hierarchy and favoring larger leaves. This approach also allows for efficient streaming intersection of primitives. Axis-aligned bounding boxes, which are stored as a minimum and a maximum corner point, are used for bounding volumes. A detailed description of the memory layout is given below in Section 1.2.1.
  • n 4
  • QBVH Quad-BVH
  • FIGS. 1-5 are a series of diagrams illustrating the splitting of a bounding volume 20 according to a BVH technique. As discussed further below, the same splitting technique may be used both in a binary BVH technique and the presently described QBVH technique.
  • rectangle x represents an axis-aligned bounding box containing a portion of a scene that includes four primitives, i.e., triangles A, B, C, and D.
  • line 22 represents a proposed axis-aligned splitting plane candidate that is used to separate the primitives contained in box x into two groups, i.e., a first group comprising triangles A and C, and a second group comprising triangles B and D.
  • rectangles vl andj2 represent axis-aligned bounding boxes that contain, respectively, the first and second groups of primitives.
  • lines 24 and 26 represent proposed "second generation" axis-aligned splitting plane candidates that are used to separate the respective groups of primitives in axis-aligned bounding boxes vl and yl into subgroups.
  • the primitives in bounding box j'l are separated into two subgroups, one containing the single primitive A, and the other containing the single primitive C.
  • the primitives in bounding box v2 are separated into two subgroups, one containing the single primitive B, and the other containing the single primitive D.
  • rectangles a, b, c and d represent axis-aligned bounding boxes that contain, respectively, individual primitives A, B, C and D.
  • nodes are created after each split.
  • inner nodes i.e., "branch nodes”
  • leaf nodes would be created for boxes ⁇ , b, c and d.
  • FIGS. 6 and 7 are a pair of side-by-side diagrams illustrating the difference between a binary BVH scheme and the presently described QBVH scheme.
  • a node is created for bounding box x.
  • splitting bounding box x yields bounding volumes vl and yl.
  • Internal branch nodes are created for each of these bounding boxes, which are then split again, resulting in bounding volumes a, b, c, and d.
  • splitting terminates at this point, and leaf nodes arc created for each of bounding volumes a, b, c and d.
  • each node contains the four bounding volumes of its children.
  • the top-level node of the tree already contains four bounding boxes, which can be processed using SIMD.
  • the four child pointers and the three axes of the proposed split planes are stored for each inner node. These split axes arc used for an efficient, sorted traversal similar to binary BVHs as described in Section 1.2.3.
  • Section 1.2.4 the paradigm of streaming processing and low memory consumption is applied to leaf intersection. This is achieved using a small cache storing the triangles in a SIMD- friendly form. This also allows more memory-efficient, but slower, representations of the triangle data without a major speed impact.
  • the triangle data is often sorted directly during tree construction for more efficient memory access in the leaf intersection.
  • this is not an option in production rendering systems, as other parts of the Tenderer may depend on the original order. For this reason, an additional index array is sorted instead.
  • the following data layout was chosen for the tree nodes:
  • the result is a large BVH node with a size of 128 bytes, which perfectly matches the typical cache size of modern computer hardware.
  • the four bounding boxes are stored in structure-of-arrays (SoA) layout for direct processing in SIMD registers.
  • SoA structure-of-arrays
  • the axes could be packed into a single value.
  • the integer (fill) is used to store additional data for optimizations later on.
  • the tree data is kept in a linear array of memory locations, so the child pointer can be stored as integer indices instead of using platform-dependent pointers. It is important to note that for a leaf, no additional (SIMD_BVH_Node) is created.
  • the leaf bounding box is already stored in the parent, and the leaf data can be encoded directly into the corresponding (child) integer.
  • the sign of the child index is used to encode whether a node is a leaf or an inner node.
  • the most memory-efficient approach is to directly encode the leaf information into the remaining 31 bits. Without loss of generality, 4 bits are chosen for the number of triangles in the leaf, and the remaining 27 bits are chosen for the start index in the triangle array. Since the triangle intersection is performed using SIMD, the number is a multiple of 4. So up to 64 triangles per leaf can be stored and up to 2 27 triangles can be indexed. Empty leaves are encoded in a special value (INT_MIN). Thus, the 4 bits can be used for the full 64 (i.e., 16> ⁇ 4) triangles. Note that when using bounding volume hierarchies the number of triangles per leaf is easily bounded, because forcing another split guarantees a reduction in the number of triangles.
  • Classical build methods are used to create a binary tree. The tree is then collapsed by keeping each fc-th level of the tree, and discarding the intermediate levels. This results in 2* bounding volumes per node.
  • Min-max binning with 8 bins was used to approximate the surface area heuristic (SAH).
  • SAH surface area heuristic
  • This implementation is vectorized by the compiler and efficiently uses SIMD instructions. Only a fraction of the primitives is used for the SAH approximation in higher levels of the tree. Further speedup is achieved by precomputing the bounding boxes of the triangles in a SIMD layout, overwriting the triangle data. This way, the implementation can benefit from SIMD operations during construction without using additional memory.
  • the triangle data is filled in. (If stored on disk, the triangle data is reloaded from disk.)
  • the construction is additionally sped up by allowing more primitives per leaf, resulting in flatter trees. In all measurements, a leaf is created whenever the primitive count drops below 17. Using this implementation, it is possible to get very fast tree construction times that are even comparable to the BIH.
  • FIG. 8 illustrates that the resulting tree consumes only a fraction of the memory needed by the original tree, at the expense of larger nodes. But this fits modern streaming SIMD architectures well. A large chunk of aligned data can be loaded and worked on in parallel.
  • the stack-based traversal algorithm starts by simultaneously intersecting the ray with the four bounding boxes contained in the root (SIMD_BVH_Node) using SIMD instructions.
  • the pointers of the children with a non-empty bounding box intersection are sorted and then pushed onto the stack.
  • the routine is repeated by popping the next element as long as there is one.
  • the ray is prepared prior to traversal by replicating the values for the maximum ray distance (tf ar), the origin, and reciprocal of the direction across a SIMD register. Additionally the signs of the components of the ray direction are stored as integers to allow easy indexing of the near and far bounding box sides.
  • the nodes are pushed onto the stack according to the order determined by the ray direction.
  • the sign of the respective component of the ray direction vector determines whether or not to swap the two corresponding children, as illustrated in FIG. 9. This is similar to the order a Ad-tree would define using the same split plane proposals.
  • the four child pointers are loaded into a SIMD register.
  • an SIMD mask is constructed for each of the three axes stored in the BVH node based on the sign of the ray direction of that component.
  • this mask is used to select either the original value or a shuffled value resulting in a correctly ordered SIMD register.
  • the contents of the SIMD child register are pushed onto the stack in a scalar way. This is done by pushing a child pointer and then decrementing the stack top pointer by the corresponding entry of the reordered result mask from the bounding box intersection. For this purpose, this mask has to be ordered in the same way as the child pointers to push the correct children onto the stack. The decrement works because the result mask contains all ones in case of an intersection, which equals -1 as an integer, and all O's else.
  • an implementation using cascaded branches based on the ray direction signs and split axes is equally fast on an Intel Core 2 processor in the general case and even slightly faster when only primary rays are traced. This may be due to the sophisticated branch prediction of this particular processor.
  • the implementation is also much simpler than the SIMD sorting.
  • the first decision based on (axisO ) chooses which two of the four children should be intersected first. In FIG. 9, this corresponds to the decision whether to visit (A, C) or (B, D) first.
  • the second and third decision sort each of the child pairs again, resulting in (A, Q or (C, A) and (B, D) or (D, B).
  • FIG. 10 is a table 100 setting forth a comparison of the two methods.
  • Table 100 shows the number of bounding box intersections, in millions, using ( 1 ) axes-based sorting and (2) distance to box entry point, measured using path tracing (512 x 512 with 16 samples per pixel).
  • the triangle data in the leaves could be precomputed and stored in an SIMD-friendly layout. Since most of the time this is not practical, the original triangle layout is left untouched.
  • 36 or 48 bytes per triangle are used.
  • the more compact vertex-index layout stores all vertices in an array and only three indices per triangle referencing these vertices. This is a widely used layout in 3D modeling applications. Usually this introduces performance penalties due to an indirection in memory access.
  • the described exemplary implementation supports the 36-bytes layout as well as the vertex-index form. Four primitives are efficiently intersected at once using a branch-free SIMD version of the test described in the art.
  • the triangle data has to be collected and swizzled to a state-of-the-art SIMD layout.
  • This caching mechanism also allows the use of vertex-index representation for the triangle data without significant performance loss, as shown in FIG. 15, discussed below.
  • the described BVH has been integrated into two different rendering systems.
  • the first one contains highly optimized implementations of a BIH and a memory-bounded ⁇ d-tree, and uses different unbiased rendering techniques. This allows for a direct comparison of rendering performance on the same hardware and using the same rendering algorithm.
  • the second system is based on instant radiosity and is used for the entry point caching comparison in Section 1.4. All measurements were performed using a single thread on an Intel Core 2 CPU at 2.33GHz. Note that all measurements were performed in a full rendering system where sampling, shading and texture look-up are a significant part of the rendering time and are not sped up by faster ray tracing. Section 1.3.1 and 1.3.2 use the first rendering system to compare the QBVH to the other acceleration structures.
  • This system only supports the 36-byte triangle layout.
  • Two versions of the QBVH are compared. The first one does not employ the SIMD triangle caching from Section 1.2.4 and is called ncQBVH. It uses leaves from 1 to 8 triangles. The second version uses a SIMD cache size of 128 entries with a leaf size of 16 triangles.
  • Section 1.3.3 the second renderer is used to evaluate the speed impact of using vertex- index based triangle representation.
  • FIG. 11 is a table 110 showing the measured rendering performance from invocation of the renderer to the final image for the conference scene, interior scene and power plant scene.
  • the numbers in parentheses are the number of triangles, in millions.
  • MEM is the amount of memory needed by the acceleration structure; ACC is the type of acceleration structure used
  • M-tree (M-tree, BIH, non-cached QBVH. and cached QBVH);
  • ACT is the tree construction time;
  • TTI is the total time to image;
  • RTT is the pure ray tracing time, i.e., the amount of time spent in the ray tracing kernel.
  • FIG. 12 is a table 120 providing detailed statistics about the tree quality and ray tracing behavior of the different acceleration kernels (ACC).
  • Table 120 shows, for the respective kernel and scene, in this order: the total number of inner nodes, the total number of non-empty leaf nodes, the average number of triangles per leaf node, the average number of intersected triangles per ray, the average number of traversed inner nodes per ray, and the average number of intersected leaf nodes per ray.
  • the measurements have been made under the same conditions as in Table 1 10 shown in FIG. 1 1.
  • FIG. 13 is a table 130 that provides a general idea of how effective the use of SIMD instructions is, by comparing packet traversal to the QBVH in the setting for which it has not been designed, i.e., casting coherent primary rays only.
  • table 130 shows a comparison of frames per second for 2x2 SSE packets in a BIH vs. mono ray BIH vs. mono ray QBVH on an Intel Core 2 Duo (480 x 320).
  • flat shading and ray creation is also done using SIMD.
  • Most of the performance benefit of the packets is due to coherent memory access, which is already lost for Interior 2, where QBVH and packet traversal perform about the same. So the QBVH is able to make effective use of SIMD instructions, even without exploiting high coherence for memory access.
  • FIG. 14 is a table 140 showing the performance of each acceleration structure with a prion bounded memory (MEM).
  • MEM prion bounded memory
  • ACC is the type of acceleration structure used.
  • the timings are given for the acceleration structure construction time (ACT), the total time to image (TTI) and the time spend in the ray tracing core (RTT).
  • the /cd-tree has serious problems with tight memory bounds (note that a build of the conference room with 5 MB was impossible to render in acceptable time) while the QBVH reacts especially robustly to limiting the memory.
  • the second Tenderer supports switching between the flat 36-byte triangle representation and a more memory-efficient index-based representation.
  • the indexed representation is more time consuming for ray tracing because an additional indirection has to be performed prior to triangle intersection. This overhead is reduced when the leaf-caching-based QBVH implementation is used.
  • FIG. 15 is a table 150 that compares the impact on rendering times and memory usage resulting from the use of the flat triangle layout with 36 bytes per triangle (36 TTI) and a vertex- index (VI) based representation.
  • TTI is the total time to image
  • MEM is the memory consumption of the triangle data.
  • the resulting QBVH trees are the same as in Section 1.3.1 and both versions use the SIMD triangle layout caching.
  • Visibility rays are an essential part in many rendering algorithms with deterministic connections of points, e.g., in bidirectional path tracing or instant radiosity. They are merely used to determine the mutual visibility of two given points. Any intersection of an interval on a ray with an object is sufficient to determine occlusion. Therefore, three optimizations are: (1) quitting the traversal on the first intersection found; (2) omitting the sorting of the nodes pushed on the stack; and (3) not computing intersection distances, which can simplify object intersection routines. Another important general optimization is to start the traversal deeper in the tree instead of from the root node. If a suitable entry point is known, this can save major parts of the process of traversing down to the first leaf.
  • the presently described approach uses a backtracking method that allows every node of the tree to be used as an entry point and will give correct results.
  • the parent is stored with each node.
  • the SIMD BVH Node still provides some room to store additional information, it is possible to keep the index of the parent node without additional memory requirements.
  • FIG. 16 is a diagram of a portion of a scene 160 illustrating the presently described approach in the setting of a point-light-source-based global illumination renderer.
  • FIG. 16 illustrates spatial coherence of occluders when using point-light-based global illumination.
  • Scene 160 includes a plurality of point light sources, represented by stars 161. Rays 162 connect vertices of interaction along paths that connect light sources 161 and cameras 163. Some of these rays are obstructed by an occluding object 164, comprising a plurality of tessellated triangle primitives. Occlusions 165 occur at the intersections of rays 162 with a visible surface of object 164.
  • the entry point cache is able to quickly find occlusions, since the visibility rays to the same point light source (depicted with a star) exhibit some coherence.
  • the big leaves of the QBVH make a hit even in the same leaf more likely.
  • the performance gain is achieved in a way which is completely transparent to the underlying renderer, no additional complexity is introduced to existing systems (as would be the case for ray packets).
  • Consecutive visibility queries result in occlusion by different but spatially nearby triangles. This fact can be exploited by using the node of the last occlusion as the entry point for the next shadow ray.
  • the tree is then searched for intersecting geometry using a backtracking algorithm: the regular traversal stack is initialized with the children of the entry point node and intersected the usual way. If no occlusion has been found so far, the entry point node is the root node, so there is no special code for this case. Next, if the stack is empty, all children of the parent of the entry point node are pushed to the stack, except the one that has just been processed. For the next iteration, the entry point node is set to its parent. The algorithm terminates when the parent of the root node is about to be processed. So the regular traversal is only extended by one variable and one additional loop. If a leaf node is found as an occluder, its parent node is cached for this point light source.
  • the rendering system uses occlusion queries given by two points, where the first one is the path's hit point and the second one the light source position.
  • occlusion queries given by two points, where the first one is the path's hit point and the second one the light source position.
  • we use a simple unbounded (i.e., hashed) voxel grid and the light position as hash. For each position the last found occlusion entry point is recorded. If no occlusion occurred, the root node is used.
  • a hash table with 4096 entries was used, the scene and divided the scene into (10cm) 3 voxels (assuming the scenes are modeled in the correct scale).
  • This optimization can be transparently incorporated into any existing rendering system which spawns shadow rays with two points to be checked for visibility.
  • FIG.17 is a table 170 of results.
  • Table 170 sets forth a comparison of the number of box intersections and total time to image between the normal QBVH version (N#BBox, NTTI) and the entry point cache version (C#BBox, CTTI).
  • N#BBox, NTTI normal QBVH version
  • C#BBox, CTTI entry point cache version
  • the images at the beginning of this paper show the test scenes used. For the measurement they were rendered only at a resolution of 800x600 pixels with 16 passes and an average of 22 point light sources per pass.
  • Conference 2 is the same conference scene but with the camera placed partly under the table.
  • the caching method could also be used for other acceleration structures. But compared to acceleration structures like the /cd-tree, the BIH or binary BVHs which use many, very densely packed, small nodes, this method works especially well for the QBVH. The additional memory requirement of one more parent index per node would result in 50% more memory for standard 8 bytes M-tree nodes.
  • the approach can also be used for non-shadow rays. Then the procedure cannot benefit from the early-out of shadow rays, i.e., processing must always continue up to the root node. Still it can be beneficial, if the rays expose some kind of coherence, for example primary rays from the eye or coherent reflections. This allows the ray tracing core to transparently exploit implicit coherence by caching previous intersection entry points without changing the interface and without performance loss when incoherent rays are used.
  • some kind of coherence for example primary rays from the eye or coherent reflections. This allows the ray tracing core to transparently exploit implicit coherence by caching previous intersection entry points without changing the interface and without performance loss when incoherent rays are used.
  • the approach may be generalized to higher arity.
  • a multiple of 4 perfectly supports current processor's streaming SIMD extensions (AltiVec, SSE).
  • AltiVec streaming SIMD extensions
  • the implementation on novel architectures like the CELL and GPUs with CTM or CUDA is simplified; the 16-way SIMD path of Intel's Larrabee processor is an obvious candidate for our algorithm and the large node size suits modern cache architectures and memory layouts very well.
  • the tree construction could also be improved. While using the standard binary tree construction for building n-ary BVHs is a simple and practical approach, faster build times could be achieved if an equally good heuristic could be found for multiple split planes at once.
  • the entry point cache could be extended by a better hashing mechanism to accelerate bidirectional light transport algorithms equally well as point light source based ones.
  • axis-aligned bounding boxes are basic techniques to accelerate geometric algorithms as for example ray tracing. It is a known problem that efficiency suffers, if the axis- aligned bounding volume contains major parts of empty space, which, in the case of ray tracing, causes more ray-object-intersection tests than required. The impact of this problem can be reduced by subdividing triangles at the cost of a larger memory footprint. We present a subdivision algorithm that is designed to generate only very few additional triangle references. Compared to previous approaches the algorithm is numerically robust, and simpler to implement and use. For formerly problematic scenes a speedup of up to a factor of 10 could be achieved, while the number of triangle references increased only by 16%.
  • An advantage of a BVH is the small memory footprint compared to a standard ftd-tree, because each object is only referenced once.
  • This new technique measures the tightness of the bounding box of each triangle edge and subdivides the triangle until a certain threshold £ v (see Section 2.2.2) is met.
  • a subdivision technique determines its axis-aligned bounding box. The volume of the largest of the three boxes is compared to a volume threshold £ v . If the volume is larger than the threshold the triangle is subdivided in the middle of this edge, which is easily implemented in a numerically robust manner. The procedure is repeated for the two new triangles until it terminates.
  • FIGS. 18A-D are a series of diagrams illustrating recursive subdivision of a triangle along the longest edges.
  • triangle 180 has three edges 18a, 18b, 18c.
  • An axis-aligned bounding box is determined for each of these edges.
  • the volume of the largest of these bounding boxes is compared with the volume threshold £ v .
  • the threshold has been exceeded with respect to edge 18c.
  • the triangle 180 is then subdivided along broken line 181 by connecting the midpoint of edge 18c to the opposite triangle vertex. The subdivision of triangle 180 results in two triangles, which are subject to the same edge threshold test.
  • the two triangles are further subdivided along broken lines 182 and 183.
  • these two triangles are divided one more time along broken lines 184 and 185. Now, since none of the triangle edges exceed threshold S v , the subdivision process terminates.
  • edge volume heuristic is economical as it only subdivides triangles with very inefficient bounding boxes.
  • the threshold is determined as a fraction of the volume V of the scene bounding box and thus controls the number of triangles generated by subdivision.
  • Section 2.3 the impact of varying t is quantified for different bounding volume hierarchy unfriendly scenes.
  • This simple threshold selection is of course not limited to the edge volume heuristic but can be used (with a different scale) for example for the early split clipping approach.
  • the procedure can be applied either as a pre-process before ray tracing or for on-demand subdivision at the beginning of the tree construction.
  • the first variant is especially useful as it can be used to upgrade any existing ray tracing module without modifying the internals.
  • the second variant is transparent for the user and just produces a different tree during construction.
  • the algorithm For each triangle the algorithm performs the recursive subdivision procedure.
  • the bounding volume hierarchy construction only uses the bounding boxes during construction, it is memory-efficient to output the bounding boxes with the original triangle reference instead and in place of the subdivided triangles.
  • BVH contains more than one triangle per leaf.
  • An additional optimization during tree construction is to remove references to the same triangle in each leaf. In our experiments this resulted in a speedup of about 10%.
  • the scan over the triangles is so efficient, that it even pays off, to have a pass that only counts the number of generated triangles, to allocate memory accordingly, and to scan the data again to generate the bounding boxes.
  • one solution is as follows. Instead of caching the hierarchies for the rigid components, the hierarchies are cached for small parts of the components. Building the top-level bounding volume hierarchy is then slower, however, this is amortized, because the overlap is dramatically reduced resulting in a higher performance of the hierarchy.
  • a Space Ship scene was chosen to represent a kind of worst-case scenario for classic bounding volume hierarchies: It consists of long thin triangles for the outer hull and many small triangles for details. Additionally this object is rotated by 45° in space.
  • a Kitchen scene one frame from the BART animation repository, was selected as a scene where, instead of moving the camera, the triangles are transformed. 4.
  • the Sponza atrium scene is a typical architectural model, where many triangles are parallel to the canonical planes. While classic heuristics like the surface area heuristic can build efficient bounding volume hierarchies for such scenes, performance drops dramatically, if geometry is rotated and bounding boxes increase in volume.
  • FIG. 19 is a table 190 setting forth factors of best, average, and worst surface area reduction resulting from applying the edge volume heuristic for triangle subdivision. The theoretical maximum of 0.25 is achieved in some cases. Even though the factor of the worst area reduction is quite large, the average area reduction shows the effectiveness of the heuristic.
  • a higher threshold parameter t improves performance, but also increases the memory footprint.
  • Both numbers are related in FIG. 20, where graphs 200, 202 and 204 show the relative increase in the number of triangles and corresponding rendering time for a range of threshold parameters /. As expected, render time improves until it asymptotically reaches saturation and the number of triangles increases with an asymptotically exponential behavior. The consistent behavior over quite different scenes clearly shows dramatic performance improvements at already very moderate increase in the number of triangles. A clearly consistent improvement over the test scenes can be observed and it is especially interesting that major performance improvements are obtained at already a moderate increase of the number of triangles.
  • the second test consists of rotating the well known Sponza atrium scene to illustrate the adaptivity of the edge volume heuristic.
  • FIG. 21 is a pair of graphs 210 and 212 illustrating an application of the presently described technique to the Sponza atrium under rotation.
  • graphs 210 and 212 show the performance figures and the number of generated triangle references over the animation for several threshold parameters t.
  • the more the architectural model becomes rotated, the more bounding boxes of previously axis-aligned geometry become inefficient, which is reliably avoided by the subdivision heuristic (threshold parameter t 14).
  • the graphs in the bottom row show how the frame time is improved and how many triangles are added by the subdivision heuristic for the unsubdivided geometry (base) and three subdivision thresholds over the 64 frames of the animation. Again the heuristic proves to be reliable: In simple cases, no triangles are added. When bounding boxes become inefficient a moderate increase in the number of triangle references avoids the dramatic performance drop.
  • Section 2 introduces an economical heuristic to subdivide triangles such that the amount of empty space in bounding boxes is efficiently reduced.
  • the technique is numerically robust, and can be used as a topology unaware preprocess to any renderer. Significant performance improvements already result from very moderate additional memory requirements.
  • FIGS. 22-25 are a series of exemplary code listings for various aspects of the invention described above.
  • a newer version of the acceleration data structure is set forth above in Section 1.2.1 - Memory Layout.
  • FIG. 23 is an exemplary code listing 230 of a tree traversal technique in accordance with an aspect of the present invention, in which one ray is intersected with four bounding boxes.
  • FIG. 24 is an exemplary code listing 240 of a cascaded "if statement in an implementation using cascaded branches based on the ray direction signs and split axes, as described above in Section 1.2.3 - Tree Traversal.
  • FIG. 25 is an exemplary code listing 250 of a technique, in accordance with an aspect of the present invention, for accelerating shadow rays. 4. Digital Processing Environment; Efficient Construction of Acceleration
  • FIGS. 26 and 27A-27B The following is a discussion, to be read in connection with FIGS. 26 and 27A-27B, of typical, relatively conventional digital processing structures and environments in which the above-described invention may be implemented and practiced.
  • the present invention provides methods, systems, devices and computer program products that enable the creation of the appearance of rounded comers and edges and other activities in computer graphics systems, whose output is typically a human-perceptible (or digitally stored and/or transmitted) image or series of images that can comprise, for example, an animated motion picture, computer aided design representation, or other typical computer graphics output.
  • the present invention can thus be implemented as part of the computer software or computer hardware of a computer that forms part of a computer graphics system, along with a display, user interface elements such as a keyboard, tablet and/or mouse, memory, storage, and other conventional computer graphics system components. While conventional components of such kind are well known to those skilled in the art, and thus need not be described in great detail herein, the following overview indicates how the present invention can be implemented in conjunction with such components in a computer graphics system.
  • the present invention can be utilized in the generation and synthesis of images, such as for display in a motion picture or other dynamic display.
  • the techniques described herein can be practiced as part of a computer graphics system, in which a pixel value is generated for pixels in an image.
  • the pixel value is representative of a point in a scene as recorded on an image plane of a simulated camera.
  • the underlying computer graphics system can be configured to generate the pixel value for an image using a selected methodology, such as that of the present invention.
  • FIG. 26 attached hereto depicts an illustrative computer system 300 that can carry out such computer graphics processes.
  • the computer system 300 in one embodiment includes a processor module 301 and operator interface elements comprising operator input components such as a keyboard 302A and/or a mouse 302B (or digitizing tablet or other analogous element(s), generally identified as operator input element(s) 302) and an operator output element such as a video display device 303.
  • the illustrative computer system 300 can be of a conventional stored-program computer architecture.
  • the processor module 301 can include, for example, one or more processor, memory and mass storage devices, such as disk and/or tape storage elements (not separately shown), which perform processing and storage operations in connection with digital data provided thereto.
  • the operator input element(s) 302 can be provided to permit an operator to input information for processing.
  • the video display device 303 can be provided to display output information generated by the processor module 301 on a screen 304 to the operator, including data that the operator may input for processing, information that the operator may input to control processing, as well as information generated during processing.
  • the processor module 301 can generate information for display by the video display device 303 using a so-called “graphical user interface” ("GUI”), in which information for various applications programs is displayed using various "windows.”
  • GUI graphical user interface
  • memory can encompass any computer readable medium, such as a computer hard disk, computer floppy disk, computer-readable flash drive, computer-readable RAM or ROM element or any other known means of encoding digital information.
  • applications programs can encompass any computer program product consisting of computer-readable programs instructions encoded and/or stored on a computer readable medium, whether that medium is fixed or removable, permanent or erasable, or otherwise. As noted, for example, in block 322 of the schematic block diagram of FIG.
  • applications and data can be stored on a disk, in RAM, ROM, on other removable or fixed storage, whether internal or external, and can be downloaded or uploaded, in accordance with practices and techniques well known in the art.
  • the present invention can take the form of software or a computer program product stored on a computer- readable medium, or it can be in the form of computer program code that can be uploaded or downloaded, or fixed in an FPGA, ROM or other electronic structure, or it can take the form of a method or a system for carrying out such a method.
  • the invention is operable to enable a computer or computer system to calculate a pixel value for pixels in an image or scene, and the pixel value can be used by other elements of a computer graphics system, which can be conventional elements such as graphics cards, display controllers, or display elements such as LCDs and/or CRTs, to generate a display-controlling electrical or electronic output, and ultimately to enable the display of an image in a human-perceptible form, and/or the storage of such an image (or data specifying such an image) for later display and/or processing.
  • a computer graphics system which can be conventional elements such as graphics cards, display controllers, or display elements such as LCDs and/or CRTs, to generate a display-controlling electrical or electronic output, and ultimately to enable the display of an image in a human-perceptible form, and/or the storage of such an image (or data specifying such an image) for later display and/or processing.
  • the computer system 300 is shown as comprising particular components, such as the keyboard 302a and mouse 302b for receiving input information from an operator, and a video display device 303 for displaying output information to the operator, it will be appreciated that the computer system 300 may include a variety of components in addition to or instead of those specifically set forth herein.
  • the processor module 301 can include one or more network ports, generally identified by reference numeral 305, which are connected to communication links which connect the computer system 300 in a computer network.
  • the network ports enable the computer system 300 to transmit information to, and receive information from, other computer systems and other devices in the network.
  • certain computer systems in the network are designated as servers, which store data and programs (generally, "information") for processing by the other, client computer systems, thereby to enable the client computer systems to conveniently share the information.
  • a client computer system which needs access to information maintained by a particular server will enable the server to download the information to it over the network. After processing the data, the client computer system may also return the processed data to the server for storage.
  • a network may also include, for example, printers and facsimile devices, digital audio or video storage and distribution devices, and the like, which may be shared among the various computer systems connected in the network.
  • the communication links interconnecting the computer systems in the network may, as is conventional, comprise any convenient information-carrying medium, including wires, optical fibers or other media for carrying signals among the computer systems.
  • Computer systems transfer information over the network by means of messages transferred over the communication links, with each message including information and an identifier identifying the device to receive the message.
  • FIGS. 27A and 27B e.g., network system 300
  • a software application configured in accordance with the invention can operate within, e.g.. a PC 302 like that shown in FIGS. 27A-27B, in which program instructions can be read from ROM or CD- ROM 316 (FIG.
  • FIG. 27B magnetic disk or other storage 320 and loaded into RAM 314 for execution by CPU 318.
  • Data can be input into the system via any known device or means, including a conventional keyboard, scanner, mouse, digitizing tablet, or other elements 303.
  • the depicted storage 320 includes removable storage.
  • applications and data 322 can be located on some or all of fixed or removable storage or ROM, or downloaded.
  • ASICs or other conventional integrated circuit or semiconductor elements can be implemented in such a manner, using the teachings of the present invention as described in greater detail herein, to carry out the methods of the present invention, as discussed herein.
  • method aspects of the present invention can be carried out within commercially available digital processing systems, such as workstations and personal computers (PCs), operating under the collective command of the workstation or PC's operating system and a computer program product configured in accordance with the present invention.
  • computer program product can encompass any set of computer-readable programs instructions encoded on a computer readable medium.
  • a computer readable medium can encompass any form of computer readable element, including, but not limited to, a computer hard disk, computer floppy disk, computer-readable flash drive, computer-readable RAM or ROM element, or any other known means of encoding, storing or providing digital information, whether local to or remote from the workstation, PC or other digital processing device or system.
  • Various forms of computer readable elements and media are well known in the computing arts, and their selection is left to the implementer.
  • the invention is operable to enable a computer system to calculate a pixel value, and the pixel value can be used by hardware elements in the computer system, which can be conventional elements such as graphics cards or display controllers, to generate a display-controlling electronic output.
  • Conventional graphics cards and display controllers are well known in the computing arts are not necessarily part of the present invention, and their selection can be left to the implementer.
  • FIG. 28 is a diagram depicting an overall method 400 in accordance with an aspect of the present invention.
  • the method is practiced in the context of a computer graphics system, in which a pixel value is generated for each pixel in an image. Each generated pixel value is representative of a point in a scene as recorded on an image plane of a simulated camera.
  • the computer graphics system is configured to generate the pixel value for an image using a selected ray-tracing methodology.
  • the selected ray-tracing methodology includes the use of a ray tree that includes at least one ray shot from the pixel into a scene along a selected direction (or from the scene into the pixel), and further includes calculations of the intersections of rays and objects (and/or surfaces of objects) in the scene.
  • bounding volume hierarchies are used to calculate the intersections of rays and surfaces in the scene.
  • a bounding box of a scene is computed.
  • the termination criterion is defined as a condition at which the bounding box coordinates differ only in one unit of resolution from a floating point representation of the ray/surface intersection point.
  • the scope of the present invention extends to other termination criteria.
  • bounding volume hierarchies as an acceleration structure is advantageous for a number of reasons.
  • the memory requirements for bounding volume hierarchies can be linearly bounded in the number of objects to be ray traced.
  • bounding volume hierarchies can be constructed much more efficiently than 3D-trees, which makes them very suitable for an amortized analysis, as required for fully animated scenes.
  • FIG. 29 is a diagram illustrating the "self-intersection" problem.
  • FIG. 29 shows a ray tracing procedure 500, including an image surface 502, an observation point 504, and a light source 506.
  • a series of computations are performed in order to locate rays extending between the observation point 504 and the surface 502:
  • FIG. 29 shows one such ray 508.
  • the calculated ray/surface intersection point 512 due to floating point arithmetic computations on computers, it is sometimes possible for the calculated ray/surface intersection point 512 to be different from the actual intersection point 510. Further, as illustrated in FIG. 29, it is possible for the calculated point 512 to be located on the "wrong" side of the surface 502. In that case, when computations are performed to locate a secondary ray 514 extending from the calculated ray/surface intersection point 512 to the light source 506, these computations indicate that the secondary ray 514 hits the surface 502 at a second intersection point 516 rather than extending directly to the light source 506. thus resulting in an imaging error.
  • An aspect of the invention provides a more precise alternative. After arriving at a calculated ray/surface intersection point 512, the calculated point 512 and the direction of the ray 508 are then used to re-compute an intersection point that is closer to the actual intersection point
  • intersection point This re-computation of the intersection point is incorporated into the ray tracing technique as an iteration that increases precision. If the iteratively computed intersection point turns out to be on the "wrong" side of the surface 502, it is moved to the "correct" side of the surface 502.
  • the iteratively computed intersection point can be moved along the surface normal, or along the axis determined by the longest component of the normal. Instead of using a global floating point e, the point is moved by an integer e to the last bits of the floating point mantissas.
  • the described procedure avoids computations in double precision and has the advantage that it implicitly adapts to the scale of the floating point number, which is determined by its exponent.
  • all secondary rays directly start from these modified points making an eoffset unnecessary.
  • intersection computation it can therefore be assumed that the ray interval of validity to begin at 0 rather than some offset (excluding 0 from the interval, as explained hereinbelow).
  • Modifying the integer representation of the mantissa also avoids numerical problems when intersecting a triangle and a plane in order to decide which points are on what side.
  • Exploiting the convex hull property of convex combinations, intersections of rays and freeform surfaces can be found by refining an axis-aligned bounding box, which contains the point of intersection nearest to the ray origin. This refinement can be continued until the resolution of floating point numbers is reached, i.e., until the bounding box coordinates differ only in one unit of resolution from the floating point representation.
  • the self-intersection problem then is avoided by selecting the bounding box corner that is closest to the surface normal in the center of the bounding box. This comer point then is used to start the secondary ray. This "ray object intersection test" is very efficient and benefits from the avoidance of the self-intersection problem.
  • the triangles are transformed in-place.
  • the new representation encodes degenerate triangles so that the intersection test can handle them without extra effort. It of course is also possible to just prevent degenerate triangles to enter the graphics pipeline.
  • the test first determines the intersection of the ray and the plane of the triangle and then excludes intersections outside the valid interval ]0, result.tfar] on the ray. This is achieved by only one integer test. Note that the +0 is excluded from the valid interval. This is important if denormalized floating point numbers are not supported. If this first determination is successful, the test proceeds by computing the Barycentric coordinates of the intersection. Note that again only an integer test, i.e., more specifically only testing two bits, is required to perform the complete inclusion test. Thus the number of branches is minimal. In order to enable this efficient test, the edges and the normal of the triangle are scaled appropriately in the transformation step.
  • the precision of the test is sufficient to avoid wrong or missed ray intersections. However, during traversal situations may occur in which it is appropriate to extend the triangles for a robust intersection test. This can be done before transforming the triangles. Since the triangles are projected along the axis identified by the longest component of their normal, this projection case has to be stored. This is achieved by counters in the leaf nodes of the acceleration data structure: The triangle references are sorted by the projection case and a leaf contains a byte for the number of triangles in each class.
  • a further aspect of the present invention provides an improved approach for constructing acceleration data structures for ray tracing. Compared with prior software implementations that follow a number of different optimizations, the approach described herein yields significantly flatter trees with superior ray tracing performance.
  • Candidates for splitting planes are given by the coordinates of the triangle vertices inside the axis-aligned bounding box to be partitioned. Note that this includes vertices that actually lie outside the bounding box, but have at least one coordinate that lies in one of the three intervals defined by the bounding box. Out of these candidates, there is selected the plane closest to middle of the longest side of the current axis-aligned bounding box. A further optimization selects only coordinates of triangles whose longest component of the surface normal matches the normal of the potential splitting plane. This procedure yields much flatter trees, since placing splitting planes through the triangle vertices implicitly reduces the number of triangles split by splitting planes. In addition, the surface is approximated tightly and empty space is maximized. If the number of triangles is higher than a specified threshold and there are no more candidates for splitting planes, the box is split in the middle along its longest side. This avoids inefficiencies of other approaches, including the use, for example, of long diagonal objects.
  • the recursive procedure of deciding which triangles belong to the left and right child of a node in the hierarchy has typically required extensive bookkeeping and memory allocation. There is a much simpler approach that only fails in exceptional cases. Only two arrays of references to the objects to be ray traced are allocated. The first array is initialized with the object references. During recursive space partition, a stack of the elements on the left is grown from the beginning of the array, while the elements, which are classified right, are kept on a stack growing from the end of the array towards the middle. In order to be able to quickly restore the elements that are intersecting a split plane, i.e., are both left and right, the second array keeps a stack of them. Thus backtracking is efficient and simple.
  • tree depth is pruned by approximately left-balancing the binary space partition starting from a fixed depth.
  • a global fixed depth parameter can be specified across a vast variety of scenes. This can be understood by observing that after a certain amount of binary space partitions usually there remain connected components that are relatively flat in space.
  • each object to be ray traced is referenced exactly once.
  • no mailbox mechanisms are required to prevent the multiple intersection of an object with a ray during the traversal of the hierarchy.
  • This is a significant advantage from the viewpoint of system performance and makes implementations on a shared memory system much simpler.
  • a second important consequence is that there cannot be more inner nodes in the tree of a bounding volume hierarchy than the total number of objects to be ray-traced.
  • the memory footprint of the acceleration data structure can be linearly bounded in the number of objects before construction. Such an a priori bound is not available for the construction of a 3D-tree, where the memory complexity is expected to increase quadratically with the number of objects to be ray-traced.
  • bounding volume hierarchies that are significantly faster than current 3D-tree ray tracing techniques, and in which the memory requirements grow linearly, rather than expected quadratically, with the number of objects to be ray-traced.
  • the core concept that allows bounding volume hierarchies to outperform 3D-trees is to focus on how space can be partitioned, rather than focusing on the bounding volumes themselves.
  • FIG. 30 is a diagram illustrating the principal data structure 400.
  • FIG. 30 shows an axis-aligned bounding box 600, in elevation view.
  • the left bounding box extends from the left wall 606 of the original bounding box 600 to the L-plane 602.
  • the right bounding box extends from the R-plane 604 to the right wall 608 of the original bounding box 402.
  • the traversal of ray 610 is determined by the positions of intersection with the L- and R-planes 602 and 604 relative to the interval of validity [N, F] 612 of the ray 610.
  • the L- and R-planes 602 and 604 are positioned with respect to each other to partition the set of objects contained within the original bounding box 600. rather than the space contained within the bounding box 600.
  • having two planes offers the possibility of maximizing the empty space between the two planes. Consequently the boundary of the scene can be approximated much faster.
  • FIGS. 31-33 are a series of three-dimensional diagrams further illustrating data structure.
  • FIG. 31 shows a diagram of bounding box 600.
  • virtual objects within bounding box 600 are depicted as abstract circles 614.
  • L-plane 602 and R-plane 604 are then used to partition bounding box 600 into a left bounding box 600a and a right bounding box 600b.
  • the L- and R-planes are selected such that the empty space between them is maximized.
  • Each virtual object 614 ends up in either the left bounding box 600a or the right bounding box 600b.
  • the virtual objects 614 are partitioned into "left" objects 614a and "right” objects 614b.
  • Each of the resulting bounding boxes 600a and 600b are themselves partitioned, and so on, until a termination criterion has been satisfied.
  • FIG. 34 is a flowchart of the described method 700.
  • a bounding box of a scene is computed.
  • parallel L- and R-planes are used to partition the axis-aligned bounding box left and right axis-aligned bounding boxes, which may overlap.
  • the left and right bounding boxes are used to partition the set of virtual objects contained with the original axis-aligned bounding box into a set of left objects and a set of right objects.
  • the left and right objects are processed recursively until a termination criterion is met.
  • two split parameters are stored within a node. Since the number of nodes is linearly bounded by the number of objects to be ray traced, an array of all nodes can be allocated once. Thus, the costly memory management of 3D-trees during construction becomes unnecessary.
  • the construction technique is much simpler than the analog for 3D-tree construction and is easily implemented in a recursive way, or by using an iterative version and a stack.
  • Given a list of objects and an axis-aligned bounding box the L- and R-planes are determined, and the set of objects is determined accordingly.
  • the left and right objects are then processed recursively until some termination criterion is met. Since the number of inner nodes is bounded, it is safe to rely on termination when there is only one object left. It should be noted that the partition only relies on sorting objects along planes that are perpendicular to the x-. y-, and z-axes, which is very efficient and numerically absolutely stable.
  • one technique is to determine a plane M 616 using the 3D-tree construction technique described above and partition the objects such that the overlap of the resulting L-plane and R-plane of the new axis-aligned bounding boxes minimally overlaps the suggested splitting plane M 616.
  • the resulting tree is very similar to the corresponding 3D-tree, however, since the object sets are partitioned rather than space, the resulting tree is much flatter.
  • Another approach is to select the R-plane and L-plane in such a way that the overlap of child boxes is minimal and the empty space is maximized if possible. It should be noted that for some objects axis-aligned bounding boxes are inefficient. An example of such a situation is a long cylinder with small radius on the diagonal of an axis-aligned bounding box.
  • FIG. 35 is a flowchart of a method 800 according to this aspect of the invention.
  • a bounding box of a scene is computed.
  • a 3D-tree construction is executed to determine a splitting plane M.
  • parallel L- and R-planes are used to partition the axis- aligned bounding box into left and right axis-aligned bounding boxes that minimally overlap the splitting plane M.
  • the left and right bounding boxes are used to partition the set of virtual objects contained within the original axis-aligned bounding box into a set of left objects and a set of right objects.
  • the left and right objects are processed recursively until a termination criterion is met.
  • the method 800 illustrated in FIG. 35, as well as the method 400 illustrated in FIG. 28, may be combined with other techniques described herein, including techniques relating to 3D-tree construction, real-time processing, bucket sorting, self- intersection, and the like.
  • 3D-tree the spatial subdivision is continued so as to cut off the empty portions of the space around the object.
  • partitioning such objects into smaller ones results in a similar behavior.
  • a maximum bounding box size is defined. All objects with an extent that exceeds the maximum bounding box size are split into smaller portions to meet the requirement. The maximum allowed size can be found by scanning the data set for the minimal extent among all objects.
  • the data structure described herein allows the transfer of the principles of fast 3D-tree traversal to bounding volume hierarchies.
  • the cases of traversal are similar: (1) only the left child; (2) only the right child; (3) the left child and then the right child; (4) the right child and then the left child; or (5) the ray is between split planes (i.e.. empty space). Since one node in the described technique is split by two parallel planes, the order of how to traverse the boxes is determined by the ray direction.
  • Previous bounding volume hierarchy techniques could not efficiently determine the order of how to traverse the child nodes or required additional effort, such as updating a heap data structure.
  • a whole bounding volume had to be loaded and tested against the ray, while the present approach only requires the two plane distances.
  • Checking the ray against the two planes in software seems to be more expensive, however.
  • the traversal is the bottle neck in 3D- trees, and doing some more computation here better hides the latencies of memory access.
  • the bounding volume hierarchy trees tend to be much smaller than corresponding 3D-trees of same performance.
  • the described bounding volume hierarchy also can be applied to efficiently find ray freeform surface intersections by subdividing the freeform surface. Doing so allows the intersection of a freeform surface with a convex hull property and a subdivision algorithm efficiently to be computed up to floating point precision, depending on the actual floating point arithmetic.
  • a subdivision step is performed, for example, for polynomial surfaces, rational surfaces, and approximating subdivision surfaces. For each axis in space the possibly overlapping bounding boxes are determined as discussed above. In case of a binary subdivision, the intersection of the L-boxes and the intersection of the R-boxes for new bounding boxes of the new meshes. Now the above-described traversal can be efficiently performed, since the spatial order of the boxes is known.
  • regular grids as an acceleration data structure in ray tracing is simple, but efficiency suffers from a lack of spatial adaptivity and the subsequent traversal of many empty grid cells.
  • Hierarchical regular grids can improve on the situation, but still are inferior as compared to bounding volume hierarchies and 3D-trees.
  • regular grids can be used to improve on the construction speed of acceleration data structures.
  • the technique for constructing the acceleration data structures are similar to quick sorting and are expected to run in 0 ⁇ n log «). An improvement can be obtained by applying bucket sorting, which runs in linear time. Therefore the axis-aligned bounding box of the objects is partitioned into n x x n y x n- axis-aligned boxes.
  • Each object then is sorted into exactly one of these boxes by one selected point, e.g.. the center of gravity or the first vertex of each triangle could be used. Then the actual axis-aligned bounding box of the objects in each grid cell is determined. These axis-aligned bounding boxes are used instead of the objects they contain as long as the box does not intersect one of the division planes. In that case the box is unpacked and instead the objects in the box will be used directly. This procedure saves a lot of comparisons and memory accesses, noticeably improves the constant of the order of the construction techniques, and also can be applied recursively. The above technique is especially appealing to hardware implementations, since it can be realized by processing a stream of objects.
  • the acceleration data structures can be built on demand, i.e., at the time when a ray is traversing a specific axis-aligned bounding box with its objects. Then on the one hand the acceleration data structure never becomes refined in regions of space, which are invisible to the rays, and caches are not polluted by data that is never touched. On the other hand after refinement the objects possibly intersected by a ray are already in the caches.
  • the present invention addresses long known issues in ray tracing and provides techniques for ray tracing having improved precision, overall speed and memory footprint of the acceleration data structures.
  • the improvements in numerical precision transfer to other number systems as well as, for example, to the logarithmic number system used in the hardware of the ART ray tracing chips.
  • the specific implementation of the IEEE floating point standard on a processor or a dedicated hardware can severely influence performance. For example, on a Pentium 4 chip denormalized numbers can degrade performance by a factor of 100 and more. As discussed above, an implementation of the invention avoids these exceptions.
  • the view of bounding volume hierarchies described herein makes them suited for realtime ray tracing.
  • the described techniques outperform the previous state of the art, thus allowing more precise techniques to be used, for example, for computing motion blur in fully animated scene, as in a production setting or the like. It will be apparent from the above discussion that the described bounding volume hierarchies have significant advantages when compared with 3D-trees and other techniques, particularly in hardware implementations and for huge scenes. In an amortized analysis, the described bounding volume hierarchies outperform current 3D-trees by at least a factor of two. In addition, the memory footprint can be determined beforehand and is linear in the number of objects. 5. Overall Techniques
  • FIG. 36 is a diagram of an exemplary processing system 1000 including a plurality of processing modules 1001-1005 in accordance with various aspects of the invention described above:
  • Module 1001 Module to subdivide (e.g., tessellate) geometry or bounding boxes in a numerically robust way.
  • Module 1002 Module to build "classic" acceleration data structure for ray tracing from previous module 1001.
  • Module 1003 Module to optionally/potentially use precomputed parts of a hierarchy to speed up the process for animated geometry.
  • Module 1004 Module to increase arity of data structure by merging subsequent levels of the hierarchy.
  • Module 1005 Module to traverse nodes using SIMD instructions for either rays or packets of rays.
  • FIG. 37 is a flowchart of an exemplary general technique 1100 according to various aspects of the invention described above:
  • Box 1101 Subdivide (e.g., tessellate) geometry or bounding boxes in a numerically robust way.
  • Box 1102 Build "classic" acceleration data structure for ray tracing from previous box 1 101.
  • Box 1103 Optionally/potentially use precomputed parts of a hierarchy to speed up the process for animated geometry.
  • Box 1104 Increase arity of data structure by merging subsequent levels of the hierarchy.
  • Box 1105 Traverse nodes using SIMD instructions for either rays or packets of rays. It should be noted that FIGS. 36 and 37 are intended to be exemplary, rather than limiting.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Image Generation (AREA)
  • Image Analysis (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

La présente invention concerne des procédés, des systèmes, des dispositifs et des produits de codes de programmes informatiques (logiciels) qui permettent une accélération du lancer de rayon à l'aide de structures de données d'accélération avec haute arité afin de permettre le traitement de nœuds à l'aide d'instructions SIMD en continu (instructions unique, données multiples) avec un encombrement mémoire réduit.
PCT/IB2008/003402 2007-11-15 2008-11-17 Hiérarchies de volumes englobants superficielles pour un lancer de rayon accéléré Ceased WO2009063319A2 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US98833707P 2007-11-15 2007-11-15
US60/988,337 2007-11-15
US11252908P 2008-11-07 2008-11-07
US61/112,529 2008-11-07

Publications (2)

Publication Number Publication Date
WO2009063319A2 true WO2009063319A2 (fr) 2009-05-22
WO2009063319A3 WO2009063319A3 (fr) 2009-08-06

Family

ID=40639240

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2008/003402 Ceased WO2009063319A2 (fr) 2007-11-15 2008-11-17 Hiérarchies de volumes englobants superficielles pour un lancer de rayon accéléré

Country Status (1)

Country Link
WO (1) WO2009063319A2 (fr)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012076778A1 (fr) 2010-12-10 2012-06-14 Real Fusio France Procédé de rendu d'images à partir d'une scène virtuelle en trois dimensions
WO2014068400A3 (fr) * 2012-11-02 2014-10-30 Imagination Technologies, Ltd. Géométrie à la demande et création de structure d'accélération
US9430863B1 (en) * 2011-10-27 2016-08-30 Nvidia Corporation System, method, and computer program product for constructing a hierarchical acceleration data structure that supports ray tracing of motion blur
WO2017139500A1 (fr) * 2016-02-11 2017-08-17 Hue As Système et procédé pour manipuler des structures d'accélération
CN109087384A (zh) * 2017-06-14 2018-12-25 想象技术有限公司 光线跟踪系统中经压缩的光线方向数据
CN109215106A (zh) * 2018-08-30 2019-01-15 东北大学 一种基于动态场景的实时光线追踪加速结构的方法
US10699370B1 (en) 2018-12-28 2020-06-30 Intel Corporation Apparatus and method for a compressed stack representation for hierarchical acceleration structures of arbitrary widths
US10719974B1 (en) 2018-12-28 2020-07-21 Intel Corporation Apparatus and method for efficiently storing ray traversal data
CN111788608A (zh) * 2018-02-22 2020-10-16 微软技术许可有限责任公司 用于建模光反射的混合射线跟踪方法
CN113192176A (zh) * 2021-04-14 2021-07-30 西安理工大学 一种变密度3d打印填充路径的生成方法
US11170254B2 (en) 2017-09-07 2021-11-09 Aurora Innovation, Inc. Method for image analysis
WO2022093583A1 (fr) 2020-10-29 2022-05-05 Advanced Micro Devices, Inc. Génération de hiérarchie de volumes englobants
US11334762B1 (en) 2017-09-07 2022-05-17 Aurora Operations, Inc. Method for image analysis
EP4345757A3 (fr) * 2022-09-27 2024-04-17 Imagination Technologies Limited Lancer de rayon
GB2617219B (en) * 2022-09-27 2025-01-01 Imagination Tech Ltd Ray tracing
CN119375862A (zh) * 2024-10-10 2025-01-28 北京理工大学 一种基于gpu和bvh结构的脉冲激光近程探测仿真方法
GB2632614A (en) * 2022-09-27 2025-02-12 Imagination Tech Ltd Ray tracing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2008249B1 (fr) * 2006-04-19 2013-07-17 NVIDIA Corporation Maquette de rayons instantanes

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE112011104332T5 (de) 2010-12-10 2013-09-12 Real Fusio France Verfahren zur dreidimensionalen Bildwiedergabe einer virtuellen Szene
WO2012076778A1 (fr) 2010-12-10 2012-06-14 Real Fusio France Procédé de rendu d'images à partir d'une scène virtuelle en trois dimensions
US9430863B1 (en) * 2011-10-27 2016-08-30 Nvidia Corporation System, method, and computer program product for constructing a hierarchical acceleration data structure that supports ray tracing of motion blur
CN104885123B (zh) * 2012-11-02 2018-01-09 想象技术有限公司 用于图形渲染的几何图形处理方法和图形渲染系统
CN104885123A (zh) * 2012-11-02 2015-09-02 想象技术有限公司 按需的几何图形和加速结构形成
US10943386B2 (en) 2012-11-02 2021-03-09 Imagination Technologies Limited On demand geometry and acceleration structure creation with tile object lists
US12211136B2 (en) 2012-11-02 2025-01-28 Imagination Technologies Limited On demand geometry and acceleration structure creation with tile object lists
US10186070B2 (en) 2012-11-02 2019-01-22 Imagination Technologies Limited On demand geometry and acceleration structure creation
US10242487B2 (en) 2012-11-02 2019-03-26 Imagination Technologies Limited On demand geometry and acceleration structure creation
US10339696B2 (en) 2012-11-02 2019-07-02 Imagination Technologies Limited On demand geometry and acceleration structure creation with discrete production scheduling
US11568592B2 (en) 2012-11-02 2023-01-31 Imagination Technologies Limited On demand geometry and acceleration structure creation with tile object lists
WO2014068400A3 (fr) * 2012-11-02 2014-10-30 Imagination Technologies, Ltd. Géométrie à la demande et création de structure d'accélération
WO2017139500A1 (fr) * 2016-02-11 2017-08-17 Hue As Système et procédé pour manipuler des structures d'accélération
CN109087384A (zh) * 2017-06-14 2018-12-25 想象技术有限公司 光线跟踪系统中经压缩的光线方向数据
CN109087384B (zh) * 2017-06-14 2023-10-24 想象技术有限公司 光线跟踪系统和方法以及光线压缩方法和模块
US11170254B2 (en) 2017-09-07 2021-11-09 Aurora Innovation, Inc. Method for image analysis
US12056209B2 (en) 2017-09-07 2024-08-06 Aurora Operations, Inc Method for image analysis
US11748446B2 (en) 2017-09-07 2023-09-05 Aurora Operations, Inc. Method for image analysis
US11334762B1 (en) 2017-09-07 2022-05-17 Aurora Operations, Inc. Method for image analysis
CN111788608A (zh) * 2018-02-22 2020-10-16 微软技术许可有限责任公司 用于建模光反射的混合射线跟踪方法
CN109215106B (zh) * 2018-08-30 2023-01-03 东北大学 一种基于动态场景的实时光线追踪加速结构的方法
CN109215106A (zh) * 2018-08-30 2019-01-15 东北大学 一种基于动态场景的实时光线追踪加速结构的方法
US11189076B2 (en) 2018-12-28 2021-11-30 Intel Corporation Apparatus and method for efficiently storing ray traversal data
EP3675057A1 (fr) * 2018-12-28 2020-07-01 INTEL Corporation Appareil et procédé de représentation d'un empilement comprimé pour des structures d'accélération hiérarchique de largeurs arbitraires
US10699370B1 (en) 2018-12-28 2020-06-30 Intel Corporation Apparatus and method for a compressed stack representation for hierarchical acceleration structures of arbitrary widths
US10719974B1 (en) 2018-12-28 2020-07-21 Intel Corporation Apparatus and method for efficiently storing ray traversal data
US11887243B2 (en) 2018-12-28 2024-01-30 Intel Corporation Apparatus and method for efficiently storing ray traversal data
WO2022093583A1 (fr) 2020-10-29 2022-05-05 Advanced Micro Devices, Inc. Génération de hiérarchie de volumes englobants
EP4238060A4 (fr) * 2020-10-29 2024-09-25 Advanced Micro Devices, Inc. Génération de hiérarchie de volumes englobants
CN113192176B (zh) * 2021-04-14 2023-11-28 西安理工大学 一种变密度3d打印填充路径的生成方法
CN113192176A (zh) * 2021-04-14 2021-07-30 西安理工大学 一种变密度3d打印填充路径的生成方法
EP4345757A3 (fr) * 2022-09-27 2024-04-17 Imagination Technologies Limited Lancer de rayon
GB2617219B (en) * 2022-09-27 2025-01-01 Imagination Tech Ltd Ray tracing
GB2632614A (en) * 2022-09-27 2025-02-12 Imagination Tech Ltd Ray tracing
US12555304B2 (en) 2022-09-27 2026-02-17 Imagination Technologies Limited Ray tracing using indications of re-entry points in a hierarchical acceleration structure
CN119375862A (zh) * 2024-10-10 2025-01-28 北京理工大学 一种基于gpu和bvh结构的脉冲激光近程探测仿真方法

Also Published As

Publication number Publication date
WO2009063319A3 (fr) 2009-08-06

Similar Documents

Publication Publication Date Title
US8188997B2 (en) Accelerated ray tracing using shallow bounding volume hierarchies
US8188996B2 (en) Shallow bounding volume hierarchies for accelerated ray tracing
WO2009063319A2 (fr) Hiérarchies de volumes englobants superficielles pour un lancer de rayon accéléré
US8411088B2 (en) Accelerated ray tracing
US7495664B2 (en) Instant ray tracing
US7659894B2 (en) Terminating spatial partition hierarchies by a priori bounding memory
US7952583B2 (en) Quasi-monte carlo light transport simulation by efficient ray tracing
US20250285358A1 (en) Hardware acceleration for ray tracing primitives that share vertices
EP2008249B1 (fr) Maquette de rayons instantanes
Woop et al. B-kd trees for hardware accelerated ray tracing of dynamic scenes
US11069124B2 (en) Systems and methods for reducing rendering latency
US11138782B2 (en) Systems and methods for rendering optical distortion effects
WO2023043993A1 (fr) Micro-maillages déplacés pour lancer de rayons
CN103765481B (zh) 用于3-d场景加速结构创建和更新的系统和方法
US10699467B2 (en) Computer-graphics based on hierarchical ray casting
Ernst et al. Early split clipping for bounding volume hierarchies
WO2009044282A2 (fr) Simulation de la propagation de la lumière par la méthode quasi-monte carlo utilisant un tracé des rayons efficace
US10553012B2 (en) Systems and methods for rendering foveated effects
US20260045025A1 (en) The efficiency of ray-box tests
WO2008091958A2 (fr) Terminaison de hiérarchies de partitions spatiales par délimitation a priori de mémoire
CN120129927A (zh) 使用场景分割渲染视频图形的方法和系统
Ji et al. View-dependent refinement of multiresolution meshes using programmable graphics hardware
Schneider et al. GPU-based euclidean distance transforms and their application to volume rendering
Marmitt et al. Recent advancements in ray-tracing based volume rendering techniques
Trapp et al. Geometry Batching using Texture-arrays.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08849903

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08849903

Country of ref document: EP

Kind code of ref document: A2