View Single Post
  #76 (permalink)  
Old 01-05-2012, 11:16 AM
BartC
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly



"Ben Bacarisse" <ben.usenet@bsb.me.uk> wrote in message
news:0.7163c7a5dd6c98958adf.20120105002252GMT.87bo qj6lr7.fsf@bsb.me.uk...
> jacob navia <jacob@spamsink.net> writes:
>
>> Le 04/01/12 23:47, Ben Bacarisse a écrit :
>>> When given the "obvious" C, gcc -O3 produces a function with 63
>>> instructions taking up 188 bytes.

>>
>> Show please...

>
> void ShiftVector(unsigned long long vector[static 8], int AmountToShift)
> {
> int rest = 64 - AmountToShift;
> vector[0] = (vector[0] << AmountToShift) | (vector[1] >> rest);
> vector[1] = (vector[1] << AmountToShift) | (vector[2] >> rest);
> vector[2] = (vector[2] << AmountToShift) | (vector[3] >> rest);
> vector[3] = (vector[3] << AmountToShift) | (vector[4] >> rest);
> vector[4] = (vector[4] << AmountToShift) | (vector[5] >> rest);
> vector[5] = (vector[5] << AmountToShift) | (vector[6] >> rest);
> vector[6] = (vector[6] << AmountToShift) | (vector[7] >> rest);
> vector[7] = (vector[7] << AmountToShift);
> }


I've tested this in 32-bit mode.

For 100 million iterations, lcc-win32 took at least 5.5 seconds, and gcc up
to -O2 took at least 4.3 seconds.

Assembly took 2.5 seconds (shifting less than 32 bits), or 3.3 seconds
(>32). (32 bits exactly took 1.4 seconds.)

However, gcc -O3 took 1.4 to 1.6 seconds (and 0.7 seconds for exactly a
32-bit shift).

I'm still looking at how it can do that, since my assembly code is pretty
short! It's obviously using inlining, but function call overheads are only
0.4 seconds.

The strange thing is that gcc's inline version of ShiftVector is only 80
instructions, but the ShiftVector code itself is about 250. There are some
parameter overheads, but not 170 instructions' worth.

--
Bartc

Reply With Quote