Order matters

Well, we've decided to be pragmatic. So let's take the bull by the horns. What to begin optimization with? I know, all of us like to dive into assembly and do all these geeky things. Nevertheless, for sure there are so many things to do before it.

I want to share the steps I walk through while doing optimization, regardless a problem size. No matter if it is a full pipeline or just a concrete computational kernel, you can use proposed approach. These steps are quite scalable.

1. Understand the code

Before starting optimization you should understand the code. What is an input? What is an output? What operation/operations are performed? The most part of the "magic" in programming stars with the phrase: "Hm... This code works... somehow." Keep calm and understand it. Optimization of the code which is unknown is like walking through a minefield. What I mean is, changes in the code look very dangerous, while the results of optimizations seem unpredictable.

2. Use appropriate algorithm

After you get enough knowledge there is time to think about the algorithm itself. Is it an optimal solution for the problem that your code should solve? Check, whether there are more optimal algorithms with lower computational complexity.

The classical example here is sorting. There are so many ways to sort things like quick sort or histogram based approaches such as radix sort. All of them differ in number of operations per a data element. Depending on the nature of an algorithm and data as well as data distribution, your algorithmic choices will vary.

Choosing an algorithm you should consider many trade-offs. It is noteworthy, one of the most important factors to think of is optimization opportunities. In particular, if there is more than one core is available you should consider parallelization possibility. Furthermore, if SIMD optimization is planned, prefer to use algorithms whose critical loops can be explained in scalarizable code.

3. Optimize memory access patterns

It is everything about memory! Interesting enough, almost all algorithms are memory bound. If it comes to function which is more complicated than streaming kernel we may found that access pattern is not ideal. Whereas other algorithms just need a huge amount of data to process.

I practice multiple optimization technologies and one of them is CUDA (which is used to optimize general purpose codes with GPUs). GPUs have very interesting architecture. They are absolutely unbalanced in terms of memory latency relative to arithmetic latency. You know what, I cannot stand when people provide technical reports and papers which have phrases similar to this one "almost theoretical throughput is achieved" with no single world about how many bytes from this amount of data they really need. Such statement about throughput says to me only that an application consumes power on byte walking through the bus.

You should beware your memory patterns. It is crucial to load only data that is really needed for computations. Usually it means some data restructuring, loop reorganization and so on.

4. Minimize the number of operations

It is quite frequent that on this step you need to figure out problems that comes from some compiler limitations first. For instance, if a compiler failed with register allocation you'll get extra operations with a stack or register movements. It is clear what these operations are the fist candidates for removal since they aren't required by the algorithm. It usually means that you need to "rephrase" the ideas explained in your code.

In particular, there are two types of vector registers in NEON (which SIMD extension to ARM processors): 64-bit D registers and 128-bit Q registers. Many, but not all, operations can be performed on both types of registers. For example, you can add two D registers with instruction vadd.u8 d0, d1, d2 as well as two Q registers vadd.u8 q0, q1, q2 with the same one. All the instructions are wrapped by c/c++ intrinsics so that operations on long and short vectors are distinguished by postfix q like add and addq. As for intrinsics, the set is complete. Thus there are intrinsics that don't correspond to single assembly instruction, such as vst3q for ARMv7 ISA. This instruction performs 3-way register interleaving before storing the result. This is extremely useful for decoding into RGB images. Recently I've done NEON optimization for color decoding from packaged Yuv to RGB color space with use of this intrinsic. Here is an assembly snippet generated with use of vst3q followed by a snippet generated with use of 2 vst3q intrinsics.


    vmov    d18, d4  @ v8qi
    vmov    d17, d30  @ v8qi
    vmov    d19, d2  @ v8qi
    vmov    d21, d22  @ v8qi
    vst3.8  {d16, d18, d20}, [r1]
    vst3.8  {d17, d19, d21}, [r0]

    vst3.8  {d16-d18}, [r1]
    vst3.8  {d10-d12}, [r0]

As you can see in both cases two D-variants of store instructions are generated, but in the first case compiler struggled with additional dependencies between registers because more strict data placement required for long instruction variant.

After all extra operations are eliminated, you can move on to code restructuring in order to minimize actual arithmetic.

Over and above, reduction in the number of operations is one of reasons wherefore loop unrolling benefits on modern out-of-order architectures. In essence, you shrink the number of operations per a data element. Of cause, it's not about minimizing number of loop counter increments. It's about:

5. Shrink the critical path

This step is connected to precious one. As a matter of fact, less number of operations require less time to process. However, This is not a rule of thumb for the most of the modern architectures because typically they are deeply pipelined. Hence multiple operations may be executed on different pipes. Moreover, depth varies from pipe to pipe, so different operations have different execution latency. As the result, it comes more about critical path (the longest chain of dependent instructions) optimization rather than just code reduction.

Finding a critical path is a tough task. Specifically, I've seen many attempts to optimize pointer increments in SIMD optimized loops. Let's look at NEON ISA again. There are load and store operations which support post increment (incrementing the pointer passed to this instruction by number of bytes read/written). In ARMv8 ISA, such a load operation is as follows ld1 {v0.16b - v3.16b}, [%[sptr0]], #64 This feature looks very interesting, since you can shrink overhead on looping over data. Most ARM compiler which I have seem generate post incremented memory operations in very limited number of cases. As the result people trying to force a compiler generate it and finishing with assembly writing. But eventually in most cases the execution time for the loop remains unchanged. This is because for most NEON-optimized loops scalar arithmetic is not a bottleneck since it is not included into the critical path. Pointer arithmetic with other scalar operations can be executed in parallel with vector operations. They executed on dedicated integer pipe and one of least latent instructions. Thereto, the result of these operations can be fast-forwarded directly to load/store pipe so latency is reduced even more. Post-increment might help only for loops there pointer arithmetic dominate over SIMD instructions. For example, some data repackaging loops, there a bunch of pointers is read or written.

6. Perform hardware specific optimizations

This is not about usage of x >> 1 instead of x / 1. For sure, a compiler can figure this out. I am talking about optimizations that require deep understanding of the hardware as well as supported instruction set.

One example here is again from CUDA. GPUs support atomic operations in hardware. They are performed in L2 cache (that is shared between separate units) as well as normal writes. One atomic operation per cycle is scheduled for each unit (I don't go in CUDA terminology to make this example understandable to a wide audience). Performance of atomics is data-dependent. I mean, everything is fast while different cache lines are accessed. To cut a long story short, atomics are more optimal for operations like following.


    ptr[index] += value;

The reason is simple. This line of code contains three operations: load, addition and store. Each of them depends on previous. Then we issue atomic we need data only in L2 not in registers. So path is shorter. This trick added me 15% improvement relative to code in the listing for streaming kernel on Kepler GPU.

6. Dive into assembly

Yep, you've done everything you can to accelerate your code. Before entering into a contract with the devil let's dive into assembly. As for me, I tend to use disasm during steps from 3 to 5. As for writing, the most common scenarios there it is affordable are overcoming compiler bugs or usage of instructions which are not present in higher level ISA.

For example, in NEON there is a 256-bit load instruction that is present only in raw assembly. vld1 {d0, d1, d2, d3}. Usage of this instruction will not accelerate your code twice (latency of this instruction higher as well), but it could help. Consecutively, if it helps, I use assembly. I never seen compiler generated this. Even for codes like the following.


    uint8_t* ptr = /**/;
    uint8x16_t val0 = vld1q(ptr+   0);
    uint8x16_t val1 = vld1q(ptr + 16);
It is clear that likelihood that pointers alias is equal to zero but compiler do not merge this operations into one.

Writing assembly you should always consider many tread-offs like code portability, readability, and so on. So, think twice.

Final words

As you can see this approach is top-down fashioned. I mean, it goes from high to low level. It is works, because higher approach we apply more effect it will bring. Improperly used algorithm may never allow to reach target performance, useless loads are still useless even if you perform them with wide transaction, etc. So let's be a bit more consistent to be a pragmatic optimizer.



<< go to the previous page go to the next page>>

Maintained by cuda-geek Hosted on GitHub Pages