Search code examples
c++algorithmoptimizationmeshnormals

How to optimize mesh normals calculation?


I am making an application for real time mesh deformation and I need to calculate normals a huge number of times. Now the problem is I found with some profiling this bit of code is taking largest cpu time so how can I possibly optimize this?

void Mesh::RecalculateNormals()
{
        for (int i = 0; i < indexCount; i += 3)
        {
            const int ia = indices[i];
            const int ib = indices[i + 1];
            const int ic = indices[i + 2];

            const glm::vec3 e1 = glm::vec3(vert[ia].position) - glm::vec3(vert[ib].position);
            const glm::vec3 e2 = glm::vec3(vert[ic].position) - glm::vec3(vert[ib].position);
            const glm::vec3 no = cross(e1, e2);

            vert[ia].normal += glm::vec4(no, 0.0);
            vert[ib].normal += glm::vec4(no, 0.0);
            vert[ic].normal += glm::vec4(no, 0.0);
        }

        for (int i = 0; i < vertexCount; i++)
            vert[i].normal = glm::vec4(glm::normalize(glm::vec3(vert[i].normal)), 0.0f);

}

Also before this function is called I must loop through all vertices and clewr the previous normals by setting normals to vec3(0).

How can I speed this up? Could there be a better algorithm? Or is GLM slow?


Solution

  • There is three main ways to optimize this code: using SIMD instructions, using multiple threads and working on the memory access pattern. None of them is a silver-bullet.

    Using SIMD instructions is not trivial in your case due to the indices-based indirect data-dependent read in memory in the first loop. That being said, recent SIMD instruction sets like AVX-2/AVX-512 on x86-64 processors and SVE on ARM ones provides instructions to perform gather loads. Once the values are loaded in SIMD registers, you can compute the cross product very quickly. The thing is the last indices-based indirect data-dependent stores in the first loop require scatter store instruction which are only available on very recent processors supporting AVX-512 (x86-64) and SVE (ARM). You can extract the values from SIMD registers so to store them serially but this will certainly quite inefficient. Hopefully, the second loop can be much more easily vectorized but you certainly need to reorder the normal data structure in a way that is more SIMD-friendly (see AoS vs SoA and Data-oriented design). In the end, I expect the SIMD implementation not much faster for the first loop as long as scater instructions are not used, and certainly significantly faster for the second one. Actually, I expect the SIMD implementation of the first loop not to be a lot faster even with gather/scatter instructions since such instructions tend to be quite inefficiently implemented and also because I expect the first loop of your code to cause a lot of cache misses (see the next sections).

    Using multiple thread is also not trivial. Indeed, while the first part of the first loop can be trivially parallelized, the second part performing the accumulation of the normal cannot. One solution is to use atomic adds. Another solution is to use a per-thread array of normal vectors. A faster solution is to use partitioning methods so to split the mesh in independent parts (each threads can safely perform the accumulation, except for possibly-shared vect items). Note that efficiently partitioning a mesh so to balance the work between many threads is far from being simple and AFAIK it has been the subject of many past research papers. The best solution is very dependent of the number of threads, the size of the overall vert buffer in memory and your performance/complexity requirements. The second loop can be trivially parallelized. The simplest solution to parallelize the loops is to use OpenMP (few pragma are enough to parallelize the loops although an efficient implementation can be quite complex).

    Regarding the input data, the first loop can be very inefficient. Indeed, ia, ib and ic can refer to very different indices causing predictable vert[...] loads/stores in memory. If the structure is big, the loads/stores will cause cache misses. Such cache misses can be very slow is the data structure does not fit in RAM due to the huge latency of the RAM. The best solution to fix this problem is to improve the locality of the memory accesses. Reordering vert items and/or indices can help a lot to improve locality and so performance. Again, partitioning methods can be used to do that. A naive first start is to sort vert so that ia is sorted while updating ib and ic indices so they are still valid. This can be computed using key-value arg-sort. If the mesh is dynamic, the problem becomes very complex to solve efficiently. Octrees and k-D trees could help to improve the locality of the computation without introducing a (too) big overhead.

    Note that you may not need to recompute all the normals regarding the previous operations. If so, you can track the one that needs to be recomputed and only perform an incremental update.

    Finally, please check compilers optimizations are enabled. Moreover, note that you can use the (unsafe) fash-math flags so to improve the auto-vectorization of your code. You should also check the SIMD instruction set used and that the glm calls are inlined by the compiler.