Essential Optimization Rules

Recently, I've started to coach others for optimization. My role is to shrink the way from journeyman to master. Interesting enough, from my experience, the most important thing is pragmatism. A Matter-of-fact way of solving problems is crucial. On the one hand, it eliminates "magic" that someone gets is he/she lacks of understanding causes. On the other hand, it takes away aimless digging into a low-level code.

What does it mean "to be a pragmatic optimizer"? As for me, a pragmatic optimizer is someone who is objective, specific and certain about the things that he/she does to make software run faster. I decided to express the idea of pragmatic optimization in several pieces of advice. For me this is a set of rules I follow while doing my work. This is a simple, but essential set that is aimed to help you do your optimization work optimally.

1. Say NO to premature optimization

Yes, I know, this rule is rather well-known. Many books about programming explain this aspect. However, I feel I also need to consider it from a pragmatic point of view.

Imagine you develop a cutting edge algorithm. This algorithm is not well defined and requires some experimental work. You think about doing it optimally and trying to do your best to ensure good performance during a whole development process. But look, there are several concerns about this approach:

A significant part of your optimal code would be done to be thrown out into a trash bin. As I mentioned, the algorithm is experimental so nobody can guarantee that the approach and code consequently will not have to be completely rewritten after several experiments. You will be disappointed if all your work is thrown out together with the failed approach. Won't you?
You won't see a full picture of the algorithm until it is complete. For example, if you have done some data restructuring which optimizes one piece of the code it could add significant overhead to other parts of the pipeline in future because it was not considered during the optimization.

Don't get me wrong, I'm not trying to dissuade you from doing your work. I urge you to choose the right moment for it.

2. Time and profile, then optimize

Consider pipeline runtime. Never start optimization if timings are not done. Do not start optimization with a function that you suspect to be a bottleneck. You need a proof! Blind optimization is not better than premature one. In general, it results in the same problems. If your guess is wrong, you spend time without any improvement. If your guess is right, you are lucky. It is pointless to rely just on luck when it comes to the fate of your project. Right?

3. Optimize the most consuming function first

This is closely connected with the previous one. Sometimes, even if someone has an application profile, it is not considered. Be objective. Do not choose the easiest, the funniest, or just the first thing on the profile list and so on. Bear in mind that 80% of the time is typically spent on 20% of the code.. You ought to find these 20%.

In most cases, it is simple to follow this rule. Just collect your profile, find a bottleneck, optimize it and move on to the the next one.

4. Identify performance limiters

What limits performance of your function? Is it a number of memory transactions? Is it a critical path in arithmetic? Is it a lack of computational resources of a particular type? Depending on a kind of limiters your optimization strategy will vary. If you don't identify limiters, you will spend ages on a problem that does not exist.

Learn the architecture to the very detail. Comprehensive knowledge will help you build a chain of cause-effect relationships and be more precise about the strengths as well as the limits of hardware and compilers.
Demystify profiling tools and disassembly. I am pretty sure that optimization without looking into disassembly is impossible. Look into it to check that a compiler did his job well. Look into it to check that you did not get extra operations like spills on stack.

5. Compute only what is necessary

Obviously, an empty function is extremely fast, but the problem is that it does nothing useful. First of all, you should understand the function and define the set of required operations. Get rid of everything except this set. A surprisingly large portion of the acceleration is done exactly this way.

As far as a single loop is concerned, you should get rid of all superfluous loads, math, type conversions and etc. If we talking about full pipeline, it implies removal of redundant memory allocations and copying, functional calls. It also implies data restructuring. You should keep up a required subset of operations at each level of your code.

6. Load only what is necessary

It is well known that most codes are memory bound. Many thins are done to improve memory performance like multilevel cache hierarchies, write back and everything. But processor performance is still a bit ahead. You should be aware of memory organization of the system you currently targeting. Moreover, you should understand the function you are currently optimizing well to minimize its memory footprint where it is possible. If so, you would not spend time to transfer useless bytes through the whole memory subsystem.

Another aspect is choosing proper data types and data structures.

7. Understand the reason of speedup

It is cool that you got some speedup. Ask yourself: am I sure about its reasons? Have I eliminated pipeline stalls? Have I reused data in caches more effectively? This understanding will be extremely helpful to generalize your knowledge and reuse approaches in future. This function is not going to be the last code you optimize, right?

8. Keep it simple

Optimization is always a tread-off between code simplicity and peak performance. You should always consider this fact while submitting new optimizations. To be honest, I love all this geeky stuff very much. I mean, writing assembly, peephole optimization. Last 10% of runtime are cut out from that kind of thing in most cases.

And finally, think about a guy who will maintain your code after you. If you don't want to be that guy, step back and trying to achieve the same performance level with a cleaner code.

9. Check one approach at a time

Do not try to apply all your ideas at once. If you do so, it would be hard to distinguish how much does each one contribute. Besides, some approaches conflict with each other and can not be applied together. Be consistent. Apply one approach, check how it performs, select the next approach to apply. Do not forget to discard the approaches that do not work.

10. Use scientific approach

Think about profiling as of a scientific experiment. Profilers are not going to do your work for you. Nonetheless a profiler helps you to get precise data to check your hypothesis. So, first of all, set the hypothesis. Otherwise you will have plenty of data and have no idea what to look for. Use a profiler to identify bottlenecks, to verify memory access patterns, to collect statistics in data-dependent workloads or to highlight hardware properties.

11. Avoid blind SIMDization

By "blind SIMDization" I mean, the process of replacing scalar code with just the same vector code. Come on! Everybody who can write code and has list of supported instructions in front of the eyes is able to do this. Furthermore, most compilers can do this as well. You probably will be surprised how complex functions could be auto-vectorized. Moreover, automatic vectorization is preferable over hand-written because it costs nothing in terms of maintenance and future portability for new revisions of platform. Maintaining your optimizations becomes a charge of compiler writers. I started to talk about auto-vectorization because it is a good example of blind SIMDization usage. Compiler knows nothing about the semantics of your code so all it can do is replacing the scalar code with its vector analogue to get exactly the same result. Be more creative. Use your knowledge about the algorithm and hardware to perform more tricky optimizations. It is not uncommon that optimal instruction chains as well as used data types for scalar and vector code are different.

Let's look at it from a different angle, speedup exceeding the number of lanes in SIMD is theoretically impossible. If your code does not benefit from something special like hardware support of a particular instruction it will be bounded with that limit. Specific instructions can help. A couple of the examples here are type conversion or native support of some transcendental instructions. In other cases you are responsible for overcoming of this limit.

And the last thing, pragmatism comes from experience. So, practice in order to find hardware limitations and best winner solutions. I promise you, you will have lots of fun.

<< go to the main page go to the next page>>

__pragmatic_optimization__