...
With respect to optimizing in reference to compilers: there have been whole teams spending years on just the optimizer of modern compilers. These people know CPU architecture by heart and are usually extremely good at what they do. So, except for extreme cases or bad compilers: the attitude of having the optimizer do its job is in my opinion the correct and practical one, because I know for a fact I probably won't be able to outsmart the people who do this job for a living.
Too many developers take this too far, expecting the teams who optimize platforms & compilers to make up for their own inability to optimize their code. Some system limitations are inherent, good programmers need to understand those limitations and find solutions that work around them.
Optimizing the compiler or system tends to give incremental gains; optimizing your solution to the problem space gives order of magnitude gains. Some examples of what I mean by this:
As part of a web server, we had to match billions of incoming strings against white/black lists having thousands of entries each. I suggested using the
Aho-Corasick algorithm. The developer implemented it in Java and it wasn't any faster than brute-force string matching. This smelled fishy so I profiled the code. Turns out the developer implemented it with java.util data structures (because why not? it's up to the JVM developers to optimize their container libraries). The code spent more than 90% of its time auto-boxing chars, and that was only the tip of the inefficiency iceberg. The developer didn't know that (A) Java can only put objects (not primitives) in containers, and (B) objects are very expensive, especially ephemeral ones, and (C) autoboxing adds even more unnecessary operations. I rewrote the same algorithm eliminating objects (Java arrays can take primitives which eliminated all the object/string/autoboxing overhead) and with no other changes it ran
20 times faster in the short term, and
50 times faster in the long term, as when garbage collection kicked in, it didn't need to clean up hundreds of millions of tiny objects.
We needed to read input from about 20 different sockets and merge the data. The developer (in C) implemented a multithreaded approach, one thread per socket. So far, so good, it was an improvement on the single threaded version. But he didn't understand how expensive heap operations can be, under certain conditions. I'm talking malloc/free, not new/delete. Changing the code so it cycled through a ring of buffers that were allocated once at startup, eliminating all subsequent malloc/free calls, made it more than
10 times faster.