View Single Post
  #96 (permalink)  
Old 01-10-2012, 04:00 PM
Wolfgang.Draxinger
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

On Fri, 23 Dec 2011 18:43:32 +0000
Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

> I'm starting a new thread, but it is prompted by this remark from
> 88888 Dihedral <dihedral88888@googlemail.com> about a short binary
> search function I posted:
>
> | Do you consider to convert 5 to 10 lines of C to assembly?
> | I did that 20 years ago on X86 cpu.
> | It was a piece of cake to gain the 20-40% speed required if paid
> | well.
>
> I've felt (it's only a feeling) that hand coding for speed (rather
> than to access features not available in C like extra wide multiplies
> and so on) was a thing of the past. But maybe I'm wrong. Is it
> still possible to get a significant speed-up over C by hand coding?


There are a few corner cases, where hand written assembly makes
compiled code see only the rear lights. But those are few and
far between. The problem is, that modern CPUs have so much side state,
that it's next to impossible to keep track of all the little details,
that may seem insignificant, but have a huge effect. Just the order of
independent instruction blocks can have a large impact.

However where hand written assembler is still useful is implementing
the inner loops of complex algorithms processing (very) large
datasets. Such algorithms usually involve only shuffling numbers around
in a strict layout, so it's easy to reason about this kind of task and
find patterns, that a compiler won't. And it allows to exploit
peculiarities of the used instruction set, a compiler never could do.
Like the one described by Dark Shikari here:

http://stackoverflow.com/a/98251/524368

> How much depends on the processor? How good are the modern
> optimising compilers that would have to be beaten?


Depends on the task at hand. If it's just about the normal execution
path of a program without many loops the compiler will most likely win,
because will put the code in a lot of configurations through some
simulation and count "clock cycles". Also there's s lot of heuristics
in it. The other case are aforementioned processing loops. You as the
algorithm, writer know about the dependencies in data access, know
where pointers can possibly point and where not (the compiler doesn't),
which allows to gain significant performance boosts in such situations.


Wolfgang

Reply With Quote