https://wiki.c2.com/?ProfileBeforeOptimizing
An unceasing temptation is to “optimize as you go”, by writing things at a lower level than you really should (e.g. reaching directly into arrays, using a reference to an instance variable in an overridable method instead of using the getter method, etc.) or adding lots of execution shortcuts for special cases.
This almost never works.
Human beings, even experienced programmers, are very poor at predicting (guessing) where a computation will bog down.
Therefore:
Write code according to constraints besides performance (clarity, flexibility, brevity). Then, after the code is actually written:
- See if you actually need to speed it up.
- Profile the code to see where it’s actually spending its time.
- Focus on the few high-payoff areas and leave the rest alone.
There are lots of ways to improve performance once you know where to do it: use a data structure that better fits your needs (lots of inserts vs. lots of deletes, lots of space
vs. lots of time, etc.), make your algorithm more clever (cache results, take advantage of order, traverse only what you need to, etc.), switch to a lower-level language, or
even implement the high-payoff area in hardware.
But if you start haphazardly trying to optimize before you actually know where things are bogging down, you’re guaranteed to be pessimizing your development efficiency.
When it comes time to optimize a piece of software, always get profiling information first, so you can tell where you need to spend your time making improvements. Without profiling data, there is no way of knowing for certain whether any “improvements” have indeed improved the code (very similar to using UnitTests to determine when a project is finished).
Usually, “profiling” means getting a profile of time spent in various routines or subsystems. This allows for optimizing for speed. Optimizing for [memory] space, or cache misses, or whatever else can be done, though some might take a little wizardry to get good profile data.
Optimizations need not be tiny tweaks, either. They could be wholesale replacement of a O(N*N*N) algorithm with a O(N*N), or outright elimination in the most grievious cases.
See: OptimizeLater (Maybe this should be merged?), UniformlySlowCode, and PrematureOptimization
ThreadMode discussion:
[[ Someone please link in examples, eg Kent’s war-story regarding three 12K strings or date-range objects (perhaps somewhere out in NullObject space?) ]]
Does somebody have experience with profiling in Java? I had problems because the JustInTimeCompiler changes the speed behaviour of the program, therefore making it difficult to profile. It seemes that the JustInTimeCompiler is turned off when getting profiling information.
Hm… any ProfilerTool which measures speed of a program requires clock cycles and memory, I suppose. HeisenbergUncertaintyPrinciple, eh? Observing something changes it. ^_^ So fudge a few tenths of a second?
See JavaProfilers
I generally agree with this – but how does this idea interact with some of the standard performance idioms you use as a matter of course? For example, in C++ I frequently pass parameters by const reference, not value, so as to avoid the overhead of frequent copy constructors. Yes, in a sense that is an optimisation, but it doesn’t (IMHO) impact maintainability, and can dramatically impact performance. It’s also tedious to put in after the fact since it involves lots of edits. — BurkhardKloss
Ferinstance, consider k db (from http://www.kx.com, or see http://www.aplusdev.org for its precursor): every operation performed by the machine is a transaction. If you put 50 machines in a cluster they combine to form a single-threaded state machine. But that’s very fat state, such as multi-gigabyte database columns being “single objects”. And the result is meaningfully fast, with real time performance guarantees. Anyways, the specific relevance here is that k semantics are call by value, but the underlying implementation is call by reference with copy on write. The more general relevance is that here is an example of a very-high-level design and implementation which, at least in its target domain, beats the socks off of analogous implementations which focus on lower-level optimizations. Or, even more generally, the fact that you focus on const parameters as an issue means you’re likely missing the big picture.
It’s interesting to look at the native java drivers for k db. There’s a trivial database interface which is much faster than jdbc — they can be freely downloaded, last I checked. Connection pooling, and other “optimizations” aren’t implemented because they slow the system down. jdbc is layered on top of that, for people who can live with the speed penalty that the jdbc design imposes.
Yeah, I write in Java, and I always pass parameters by const reference 🙂 .
For things like this, the question is, “Why wouldn’t you do this?” If, as you point out, there’s no reason not to — including maintainability — go ahead and do it: it’s free.
If there is any impact on clarity or any other aspect of maintainability, though, then you’re making a purchase, and you should put off making it until you know whether you need to. The principle is stated extremely because most of us, most of the time, think of things as free that aren’t (e.g. external iteration code). —GeorgePaci
I believe this is merely an example of using an language as it is intended, not an optimization. In C++, you need to be aware of the differences between references, pointers, and values and use them appropriately. —WayneMack
When it comes down to it, passing by const reference simply to avoid a copy is an optimization, and more importantly one the compiler can do for you (and yes, I have tested this (with gcc 4.7)). One of the nice things about C++ specifically is that it goes to great lengths to promote so-called “value semantics,” the homoiconity between fundamental types and complex user-defined data types that allows you to Say What You Mean. Limiting the use of const reference parameters to things like copy constructors and assignment operators, things for which crefs are absolutely the norm (to the extent that the compiler will provide their definitions for you!), I find makes code appreciably, though not overwhelmingly, more readable. I don’t have to worry about whether I actually can make my functions const or whether I have to call by value, because having proper value semantics ensures that things Just Work. In other words, don’t be so quick to give up the generalizability (read: future extensibility) and possibly clarity of your code to do something which not only may be unnecessary, but may in fact be done for you once you crank your compiler up to -O11! In (yet) other words, ProfileBeforeOptimizing.
BurkhardKloss writes how does this idea interact with some of the standard performance idioms you use as a matter of course?
The example he gives turns out not to illustrate his point well, but there are others. In Java, it is common to use a StringBuffer rather than the string concatenation operator (+). And some recommend declaring local variables outside loops rather than in them. Both of these performance idioms reduce readability (unless you are very familiar with them, at which point they are transparent).
Although these tricks apparently helped in Java 1.0, they are both approximately neutral in 1.2, and can be harmful to performance in 1.3. Why? Because the JVMs and compilers have improved substantially and are now clever enough to optimize away the performance issues that the idioms fix. But they aren’t clever enough to recognize what they idioms are doing, so they can’t apply the same optimizations.
So a former performance idiom has become a liability. And for many programmers who were taught these as rituals rather than optimization tricks, they have become OldRulesWithForgottenReasons. Yuck! So the general rule of ProfileBeforeOptimizing still holds.
Probably first published in by BrianKernighan and PjPlauger. They have some other maxims in this area:
- Make it clear before you make it fast
- Make it correct before you make it fast
The bottleneck isn’t where you think it is.
I can certainly recall examples from my own and colleagues’ code where I (or they) did not know where the bottleneck was. However, I can also recall many examples where, after a moment’s thought, I did know where it was.
The very fact that you know which were which shows that you tested your intuitions, rather than just assuming they were always right, which illustrates the thesis of this rule. Same is true of the fact that your intuitions weren’t correct 100% of the time…that too illustrates the value of the rule. Obviously some people will never have correct insight, but they should only be following the first of the RulesOfOptimization, not the third.
The ProfilerTool is still useful as insurance if nothing else. Your time is valuable. Even if you’re right, you need objective evidence to justify spending time fixing the performance problem. So I’m not disputing the advice to ProfileBeforeOptimizing. However, I don’t recall ever seeing any evidence that the rule itself ever been tested scientifically. Is it just FolkWisdom? Or is my concern MetaWisdom gone awry?
Even if you know exactly where the bottleneck is, it’s often handy to measure the effect of your optimizations, if only so you know when to stop optimizing.
Also, it’s nice to have concrete numbers to satisfy your curiosity and to put in your status report.
This is how I profile: I launch the program, then stop it manually a few times. Each time I stop it, I have complete access to all context information – local vars, the stack trace, and everything. You don’t need that many samples to know that e.g. a certain SQL execution, or some string copying, or whatnot, is eating a lot of CPU cycles.
At least, this method tells me where the wall clock time goes, which is often the most important. A situation where my method is indispensable (compared to automatic profiling tools) is in interpretive environments. Sure, it spends its time handling some large datastructure, but _where_ in that datastructure. So what my method gives me is not just “which function uses the time”, but just as important “what are all those data that it is called with, all the time”.
The method is also excellent in a distributed system: “Oh, so it is waiting for that server most of the time”, “let’s see what it was asking for – could we maybe bulk process that?”.
Also, my profiling doesn’t interfere with the normal execution – or, well, it requires debug information, which rules out certain optimizations, but aside from that…
— BjarkeDahlEbert, Nov 2003
A good ProfilerTool should be able to capture key values for you as it profiles, recording these against the timestamps it collects. You shouldn’t need to stop the program manually. This is much less likely to be accurate. For instance, the break you’re providing by stopping the program could allow asynchronous database reads to progress outside your measured execution time, skewing the results.
MikeDunlavey replying to the above paragraph: Assume the program is not optimal – it’s spending some wall time it doesn’t need to, and you want to find that. If there is a small routine doing a bubble sort of a large integer array, then any CPU profiler showing self % should find it. If it is sorting strings, calling strcmp, then only profilers showing inclusive (total) % would show it. If the routine is not small, but is big as many are, you are left hunting in it for the guilty code, possibly focusing on the wrong statements. If the profiler gives you a graph with the inclusive %, maybe you can find it if strcmp is not called in too many places. Now suppose the problem is not a long bubble sort, but a few routines each being called one time too many times from somewhere. If this is happening simultaneously at different levels of the call stack, those factors can multiply together, but no CPU profiler will really show where the problem is. Suppose the problem is various different class objects being created and their constructors being run when they could in fact be re-used. CPU profilers will have a hard time with this because it is not clear how to summarize the results in a way that draws attention to the problem. Now suppose the call tree is fairly deep, going down into library, system, or routines written by others, that just happen to do some IO, like logging something you’re not aware of, pinging out for an internet address, looking for a resource on a dll, shelling out some command-line function, etc. etc. If the profiler is a CPU-profiler, it is blind to that, so there is no way it can possibly tell you a big % of wall time is going into that. The long and short of all this is – typically you need to FIND THE PROBLEM, not MEASURE STUFF. Accuracy of measurement is not the goal. If you can find out what’s really taking a lot of time that you can fix, do you really care if you only know the amount of time is somewhere between 40% and 60%? If you fix it, you can easily find out how much time you saved, to any accuracy, by just timing it. But if you can’t find the problem, you can’t fix it. There is a widespread assumption, which is not a theorem, that any of these problems will be found if you only had the right profiler, but that’s a giant IF. Bjarke’s method can find any of these problems effectively, because it applies all the intelligence of the programmer to understand what’s going on, rather than hoping some automatic summarizer will draw attention to it. If you halt it and examine the state 10 times, then any problem whose fixing would save 50% of time will be precisely in evidence on roughly 50% of the samples. On the other hand, if you want to be told precise measurements, and be told there is no evident problem (though there may be one), then use a profiler. Self-deception always feels good.
See ProfilerTool for discussion and comparison of specific profilers.
I have been collecting some basic optimization rules for years and so far have come up with this list:
- It is faster to do something once than to do it 10,000 times.
- Fastest I/O is no I/O (I think this one is from BrianKernighan)
- Only optimize code that is slow
Of course these rules should be applied only if the above rules have been considered.
JustOneOptimization – fixing the worst of your problems may cause the next 10 to drop off the scope. Therefore, follow the first three rules of optimization, do JustOneOptimization, then start again at rule #1.
I know this one’s widely accepted (i.e. not my idea), but it ain’t mentioned here yet and I don’t know an apposite Knuth/Kernighan/Ritchie/Beck quote to put on its page to sit along with the other three – anyone got one? — BrianEwins
I’ve seen quite the opposite. Fix just one, most of the rest of the system crawls.
It is much better to make the code simpler. JustRefactorTheCode: Simple code has almost always good performance. If the code is simple enough, fixing all the performance problems with just one tweak should be easy. —GuillermoSchwarz
[This is simply not true in some programming domains, in particular I am thinking of numerical analysis (or other heavy simulation code). It is precisely in these sorts of domains where complicated code and algorithms may be justified because they can realize huge gains over simple (and often naive) code. The question of whether this justifies the maintenance cost is , of course, application dependant.]
ReFactoring code has nothing to do with simplicity or complexity of algorithms, merely how the algorithms are expressed. The more complicated the expression of the algorithm becomes, the more difficult it is to identify areas for possible improvement. It also increases the the liklihood of making otherwise obvious mistakes. One reason simple code tends to have better performance is that it is harder to hide bad decisions.
I like:
- “More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason – including blind stupidity.” — WilliamWulf
Does anyone have a source for the observation that “Human beings, even experienced programmers, are very poor at predicting (guessing) where a computation will bog down.” (and that profilers are better)? It’s widely known and I’m not disputing it, but it would be nice to know it was more than anecdotal…. TIA, Alan.
Don’t forget what may be the simplest, and quite possibly most effective, way to find performance problems. You run the program under a debugger and, while it is doing whatever takes too long, manually halt it with a “pause” key. Make a record of the call stack. Do it again. Compare the call stacks to find common call instructions. Look for these call instructions in the source code to see if they are truly necessary, or could be avoided. If necessary, take additional samples of the call stack until you find such calls. Four or five is typical; I’ve never needed more than 20. The reason this works is that the call stack exists for the entire time your program is running, and any call instruction that is active on the call stack some percent of the time, say 30%, would save you that much if you could avoid it. Certainly you could try to speed up the routine being called, but it might be a lot easier just not to call it so much. Unfortunately, most profilers summarize at the level of whole functions / methods rather than on the statements / instructions where the calls occur. — MikeDunlavey
That seems like a wise and experience-driven approach to profiling.
Thanks, and it’s good to see that BjarkeDahlEbert above had the same idea. I think the world is profiler-happy at the moment, which is focussing on the tool, not the problem. I’ve seen people using profilers be pleased to get 20 percent improvement. With this method, whole gobs of stuff can be cut out, and 20 times is not too unusual.
The profiler says the bottleneck is network latency. The client and server are three hours apart. Now what?
Find a way to use fewer round-trip interactions between the client and the server.
I’d say to go with an ISP with more direct wiring. 3 light hours (around 21.64 AU according to google) is about the distance from the Sun to Uranus (no jokes please). I am assuming you aren’t from SETI and that electrical signals travel at a significant proportion of the speed of light. — AaronRobson
CategoryOptimization
Last edit March 26, 2014