Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.c

Reply
 
Thread Tools Display Modes
  #91 (permalink)  
Old 01-07-2012, 12:36 PM
Stephen Sprunk
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

On 07-Jan-12 03:33, jacob navia wrote:
> Le 07/01/12 06:39, Stephen Sprunk a écrit :
>> On 25-Dec-11 14:35, jacob navia wrote:
>>> Look at this:
>>>
>>> void doADD(double *L, double *R, double *result,size_t n)
>>> {
>>> for(size_t i=0; i<n;i++) result[i]=L[i]+R[i];
>>> }
>>>
>>> Assembly will always do better than standard C in the above specific
>>> code. You can then use the ADDPD and add 2 integers at the same time
>>> (under specific loop conditions).

>>
>> Really? I don't know how to get GCC to assume the arrays above are
>> properly aligned, though I'm sure there's a way.

>
> Maybe, but that is exactly my point. Why couldn't we be able to
> DECLARE that those arrays are properly aligned and are suitable
> for SIMD processing?


In the code above, there are three pointers, which may or may not be
suitably aligned for SIMD instructions. If they were declared as arrays
(which is the slight modification I made), the compiler will know
they're aligned.

> Why should we expect that the compiler implements a HUGE and
> brittle machinery to DEDUCE that instead?


It's not huge or brittle; it's just one AND operation and conditional
branch per pointer. GCC doesn't bother to do it, for reasons I don't
understand, but it is not an unreasonable optimization given the
performance benefits of aligned access.

> Wouldn't it be simpler if we would just be able to declare those
> in the same way that we say:
> int i;
> instead of expecting the commpiler to DEDUCE that because of the
> usage of that variable it must be an int?


Huh? What I did was take L, R and result out of the parameter list and
declare them like this:

double L[256], R[256], result[256];

That is sufficient for the compiler to know they're properly aligned,
unlike a (double *).

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #92 (permalink)  
Old 01-07-2012, 07:09 PM
Joe keane
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

In article <je927u$g2m$1@speranza.aioe.org>,
jacob navia <jacob@jspamsink.org> wrote:
> do operation 1
> do operation 2
> ... etc


My brain is warped from reading too much compiler output, so stuff like
this makes sense:

A part 1
A part 2
B part 1
A part 4
X part 8
D part 2
Reply With Quote
  #93 (permalink)  
Old 01-07-2012, 08:03 PM
Ben Bacarisse
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

Stephen Sprunk <stephen@sprunk.org> writes:
> On 07-Jan-12 03:33, jacob navia wrote:
>> Le 07/01/12 06:39, Stephen Sprunk a écrit :
>>> On 25-Dec-11 14:35, jacob navia wrote:

<snip>
>>>> void doADD(double *L, double *R, double *result,size_t n)
>>>> {
>>>> for(size_t i=0; i<n;i++) result[i]=L[i]+R[i];
>>>> }

<snip>
> In the code above, there are three pointers, which may or may not be
> suitably aligned for SIMD instructions. If they were declared as arrays
> (which is the slight modification I made), the compiler will know
> they're aligned.


My gcc (4.6.1) generates addpd instructions with only -O3, even when L,
R and result are pointers. I wonder if the is a gcc version difference
or something to do with my system (64-bit Linux).

<snip>
--
Ben.
Reply With Quote
  #94 (permalink)  
Old 01-07-2012, 09:42 PM
Stephen Sprunk
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

On 07-Jan-12 15:03, Ben Bacarisse wrote:
> Stephen Sprunk <stephen@sprunk.org> writes:
>> On 07-Jan-12 03:33, jacob navia wrote:
>>> Le 07/01/12 06:39, Stephen Sprunk a écrit :
>>>> On 25-Dec-11 14:35, jacob navia wrote:

> <snip>
>>>>> void doADD(double *L, double *R, double *result,size_t n)
>>>>> {
>>>>> for(size_t i=0; i<n;i++) result[i]=L[i]+R[i];
>>>>> }

> <snip>
>> In the code above, there are three pointers, which may or may not be
>> suitably aligned for SIMD instructions. If they were declared as arrays
>> (which is the slight modification I made), the compiler will know
>> they're aligned.

>
> My gcc (4.6.1) generates addpd instructions with only -O3, even when L,
> R and result are pointers. I wonder if the is a gcc version difference
> or something to do with my system (64-bit Linux).


Probably; my gcc is 4.2.4, which is a fairly old branch. I'll see if I
can talk my sysadmin into upgrading.

S


--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking
Reply With Quote
  #95 (permalink)  
Old 01-10-2012, 08:07 AM
88888 Dihedral
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

88888 Dihedral於 2012年1月7日星期*UTC+8下午6時15分06秒 道:
> 88888 Dihedral於 2012年1月6日星期五UTC+8上午5時39分31秒 道:
> > jacob navia於 2012年1月5日星期四UTC+8上午3時22分25秒 道:
> > > Le 04/01/12 18:59, Stephen Sprunk a écrit :
> > > > On 26-Dec-11 14:36, jacob navia wrote:
> > > >> Le 26/12/11 21:24, Stephen Sprunk a écrit :
> > > >>> On 25-Dec-11 14:35, jacob navia wrote:
> > > >>>> Assembly will always do better than standard C in the above specific
> > > >>>> code. You can then use the ADDPD and add 2 integers at the same time
> > > >>>> (under specific loop conditions).
> > > >>>
> > > >>> ... unless you have a vectorizing compiler, which will do the exact same
> > > >>> thing for you without the need for inline assembly.
> > > >>
> > > >> Sure. That was a simple example, and no, gcc doesn't do it, at least
> > > >> in the version I have
> > > >
> > > > Like many GCC sub-projects, the work isn't complete, and probably isn't
> > > > on by default:
> > > > http://gcc.gnu.org/projects/tree-ssa/vectorization.html
> > > >
> > >
> > >
> > > Note that those instructions are available since at least 10 YEARS
> > > and still mainstream compilers like MSVC or GCC do not perform any of
> > > the magic you invoke...
> > >
> > > Sure, maybe ONE DAY IN THE FAR FUTURE they will do it, but until thenI
> > > am using those instructions in my programs now...
> > >
> > >
> > > > GCC isn't the only compiler out there, either; I've heard ICC is very
> > > > good at this, for instance.
> > > >
> > >
> > > Yes, you need ALLL THE RESOURCES of Intel Corp to build such a compiler,
> > > no wonder nobody has done that besides Intel.
> > >
> > > >>> Also, when AVX support comes out, that vectorizing compiler will start
> > > >>> using AVX instructions while your assembly is stuck using SSE.
> > > >>
> > > >> Sure, you can recompile your program bu I can't change a few loop
> > > >> `conditions since I am "stuck" with my old program, poor me.
> > > >
> > > > AVX will be different instructions, different registers, etc.
> > >
> > > What?
> > >
> > > Look Stephen, you are surely knowledgable in C but in assembly...
> > > Take for instance ANDPS (AND NOT PACKED Single precision)
> > >
> > > The corresponding AVX instructions is...
> > > VANDNPS, with the same syntax same semantics. You just
> > > add a "V" before the instruction mnemonic.
> > >
> > > All you have to do is use the ymm registers with 256 bits
> > > instead of the xmm registers with 128 bits. Your loop
> > > counters must be adjusted (using only half as many loops)
> > > and that was it, maybe an hour of work for a big program.
> > >
> > >
> > > > It won't
> > > > be a complete rewrite from scratch, but it'll be a lot of work, whereas
> > > > someone with a vectorizing compiler can just change one option and
> > > > recompile.
> > > >
> > >
> > > Yes, but it will be at least 10 years before gcc or MSVC propose that..
> > > gcc doesn't even propose automatically SSE2 or SSE3 TODAY, 10 years
> > > after it was announced by Intel...
> > >
> > >
> > > >>> Furthermore, there is no guaranteed win for using SEE; early SSE chips
> > > >>> simply decoded that into two FADDs, so there was no performance benefit
> > > >>> to all that work you invested in writing SSE assembly.
> > > >>
> > >
> > > That was in 2000, when it first appeared. Pleeeeeze, we are 2012 now,
> > > that is no longer relevant at all.
> > >
> > >
> > > But let's forget about that. Look at this.
> > >
> > > problem:
> > >
> > > You have a vector of 8 long long integers and you want to shift that
> > > vector by a given amount left between 1 and 63 bits.
> > >
> > > Interface:
> > >
> > > void ShiftVector(long long *vector, int AmountToShift);
> > > ; Pointer to vector in %rdi
> > > ; Amount to shift in %rsi
> > > ; gcc calling conventions for x86-64
> > > shiftvector:
> > > movq %rsi,%rcx

> > AmountToShift -> rcx
> >
> > >
> > > movq (%rdi),%rsi

> >
> > vector[0] -> rsi
> > > movq 8(%rdi),%rax

> > vector[1] -> rax
> > > shldq %cl,%rax,%rsi

> >
> > Do the shift to the left.

>
> the (rax, rsi) bit shifted result in rsi and rax not changed
> >
> > > movq %rsi,(%rdi)

> >
> > Save the rsi part. Leave the rax part.
> > >
> > > movq 16(%rdi),%rsi

> > vector[2] -> rsi
> >
> >
> > > shldq %cl,%rsi,%rax

> > Do the shift.

>
> the (rsi, rax) bit shifted result in rax and rsi not changed
> if there's an instruction to do this directly
>
>
> > > movq %rax,8(%rdi)

> >
> > Save the rax part. Leave the rsi part.
> >

>
>
> > >
> > > movq 24(%rdi),%rax
> > > shldq %cl,%rax,%rsi
> > > movq %rsi,16(%rdi)
> > >
> > > movq 32(%rdi),%rsi
> > > shldq %cl,%rsi,%rax
> > > movq %rax,24(%rdi)
> > >
> > > movq 40(%rdi),%rax
> > > shldq %cl,%rax,%rsi
> > > movq %rsi,32(%rdi)
> > >
> > > movq 48(%rdi),%rsi
> > > shldq %cl,%rsi,%rax
> > > movq %rax,40(%rdi)
> > >
> > > movq 56(%rdi),%rax
> > > shldq %cl,%rax,%rsi
> > > movq %rsi,48(%rdi)
> > >
> > > shlq %cl,%rsi
> > > movq %rsi,56(%rdi)
> > >
> > > ret
> > >
> > > 26 instructions, 97 bytes (!!!)
> > >
> > > Write that in C and see how big your procedure is, and how slow :-)


RBI, RCI, and REI not used, thus it's possible to speed up more
by dual or quadro-cores in X86.
Reply With Quote
  #96 (permalink)  
Old 01-10-2012, 04:00 PM
Wolfgang.Draxinger
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

On Fri, 23 Dec 2011 18:43:32 +0000
Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

> I'm starting a new thread, but it is prompted by this remark from
> 88888 Dihedral <dihedral88888@googlemail.com> about a short binary
> search function I posted:
>
> | Do you consider to convert 5 to 10 lines of C to assembly?
> | I did that 20 years ago on X86 cpu.
> | It was a piece of cake to gain the 20-40% speed required if paid
> | well.
>
> I've felt (it's only a feeling) that hand coding for speed (rather
> than to access features not available in C like extra wide multiplies
> and so on) was a thing of the past. But maybe I'm wrong. Is it
> still possible to get a significant speed-up over C by hand coding?


There are a few corner cases, where hand written assembly makes
compiled code see only the rear lights. But those are few and
far between. The problem is, that modern CPUs have so much side state,
that it's next to impossible to keep track of all the little details,
that may seem insignificant, but have a huge effect. Just the order of
independent instruction blocks can have a large impact.

However where hand written assembler is still useful is implementing
the inner loops of complex algorithms processing (very) large
datasets. Such algorithms usually involve only shuffling numbers around
in a strict layout, so it's easy to reason about this kind of task and
find patterns, that a compiler won't. And it allows to exploit
peculiarities of the used instruction set, a compiler never could do.
Like the one described by Dark Shikari here:

http://stackoverflow.com/a/98251/524368

> How much depends on the processor? How good are the modern
> optimising compilers that would have to be beaten?


Depends on the task at hand. If it's just about the normal execution
path of a program without many loops the compiler will most likely win,
because will put the code in a lot of configurations through some
simulation and count "clock cycles". Also there's s lot of heuristics
in it. The other case are aforementioned processing loops. You as the
algorithm, writer know about the dependencies in data access, know
where pointers can possibly point and where not (the compiler doesn't),
which allows to gain significant performance boosts in such situations.


Wolfgang

Reply With Quote
  #97 (permalink)  
Old 01-25-2012, 11:50 PM
Tim Rentsch
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

> James Harris <james.harris.1@googlemail.com> writes:
>
>> On Dec 25, 12:07 am, Ben Bacarisse <ben.use...@bsb.me.uk> wrote:
>>> James Harris <james.harri...@googlemail.com> writes:
>>> > On Dec 23, 6:43 pm, Ben Bacarisse <ben.use...@bsb.me.uk> wrote:

>>
>> ...
>>
>>> >> size_t binsearch(int a[], size_t length, int key)
>>>
>>> > Why use size_t instead of unsigned int or unsigned long int,
>>> > especially as you add it to sum in the test harness which is long int?
>>> > Why not use unsigned for sum in the test harness?
>>>
>>> int is better yet because the output will then be same across machines.
>>> Using *any* unsigned type means the "not found" -1 return becomes a
>>> value that can vary from implementation to implementation.

>>
>> You mean -1 may have a strange value on a 1s-complement machine? On a
>> 2s-complement machine wouldn't it always be all bits set even if
>> assigned to an unsigned variable?

>
> No, the machine representation of -1 has no bearing on the result of the
> C code[1]. What I mean is just that a return of -1 has a value of
> SIZE_MAX and that varies from implementation to implementation. It's a
> good value to return in a real binary search because a valid index can't
> ever be (quite) that large -- the largest C-supported object of SIZE_MAX
> bytes can have indexes no higher than SIZE_MAX - 1.
>
> However, it's not a good return for a portable test program like this
> because the sum of the returns will vary from machine to machine.
> Alternatively, changing the simple sum to something like this:
>
> size_t idx = binsearch(data, length, key)
> sum += idx == (size_t)-1 ? -1 : idx;
>
> could have made the output more portable. [snip rest]


Presumably you meant something different for that last
line of code, since in most cases the effect would be
the same as

sum += idx;

Reply With Quote
  #98 (permalink)  
Old 01-26-2012, 12:03 AM
Tim Rentsch
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

> [snip] There was a thread here a while back about
> whether an object can be bigger than SIZE_MAX. It probably can, so
> the compiler would not be able to make the assumption [that an
> index into an array can never exceed some threshold].


Since it is the implementation (ie, the compiler) that is making
the decision (as to whether it ever creates such objects), it is
perfectly free to assume that it made whatever choice it made!
Reply With Quote
  #99 (permalink)  
Old 01-26-2012, 12:07 AM
Tim Rentsch
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

jacob navia <jacob@spamsink.net> writes:

> Le 29/12/11 02:12, Phil Carmody a @C3{A9}crit :
>>
>> Stop right there. Why can they not have their top bits set?
>> Are you presuming I can't have an array of over 2^31 ints?

>
> The original program uses "int" to index, and "int" under windows
> linux and mac is at most 32 bits. This is not a problem of assembly
> but of the original C program
>
>> Can you justify that presumption by citing a relevant C standard?

>
> Yes. In the parafraph "Translation limits" the standard says that
> an implemetation is not required to support objects with
> more than 65535 bytes.


The depth of understanding reflected in this statement
is truly awe inspiring.
Reply With Quote
  #100 (permalink)  
Old 01-26-2012, 12:23 AM
Tim Rentsch
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

"BartC" <bc@freeuk.com> writes:

> "Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
> news:87mxacyyd8.fsf@bazspaz.fatphil.org...
>> James Harris <james.harris.1@googlemail.com> writes:

>
>>> Understood. When I came to code this in 32-bit assembler I realised
>>> that, since high and low can never have their top bits set,

>>
>> Stop right there. Why can they not have their top bits set?
>> Are you presuming I can't have an array of over 2^31 ints?
>> Can you justify that presumption by citing a relevant C standard?
>> Have you failed to notice that there are x86-based boxes with
>> 256GB RAM on the market nowadays, and 8GiB arrays shouldn't
>> be considered unexpected any more.

>
> The address calculation for int-arrays usually involves multiplying by
> 2, 4 or 8.


Conceptually, but the actual semantics isn't defined in
terms of multiplication; it only says that the resulting
pointer must point to the appropriate array element.
(Incidental note: in fact the Standard even uses the
word "conceptually" in a footnote in the section on
array indexing.)

> With an index having the top bit set, that will overflow
> the size_t range. Whatever size_t happens to be.


The Standard doesn't contain any statement about the type in
which array indexing is carried out; the defining expressions
are mathematical, not given in terms of C operators. As long as
the array object referred to is big enough, array indexing has to
work even if the index is less than PTRDIFF_MIN or greater than
PTRDIFF_MAX or SIZE_MAX.
Reply With Quote
  #101 (permalink)  
Old 01-26-2012, 08:59 PM
Ben Bacarisse
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

Tim Rentsch <txr@alumni.caltech.edu> writes:

> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

<snip>
>> size_t idx = binsearch(data, length, key)
>> sum += idx == (size_t)-1 ? -1 : idx;

<snip>
> Presumably you meant something different for that last
> line of code, since in most cases the effect would be
> the same as
>
> sum += idx;


Yes, I intended something like

if (idx == (size_t)-1)
sum -= 1;
else sum += idx;

but turned it into an entirely not equivalent conditional expression.

--
Ben.
Reply With Quote
  #102 (permalink)  
Old 01-26-2012, 09:08 PM
Ben Bacarisse
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

Tim Rentsch <txr@alumni.caltech.edu> writes:

> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>
>> [snip] There was a thread here a while back about
>> whether an object can be bigger than SIZE_MAX. It probably can, so
>> the compiler would not be able to make the assumption [that an
>> index into an array can never exceed some threshold].

>
> Since it is the implementation (ie, the compiler) that is making
> the decision (as to whether it ever creates such objects), it is
> perfectly free to assume that it made whatever choice it made!


I used the word "compiler" deliberately. Where there is a unified
implementation, then I agree that one part (the optimiser) can assume
facts about other parts (for example the C library), but that's not been
my experience. Most compilers have to work without knowing what
implementation they will end up part of.

--
Ben.
Reply With Quote
  #103 (permalink)  
Old 02-01-2012, 05:20 AM
Tim Rentsch
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

> Tim Rentsch <txr@alumni.caltech.edu> writes:
>
>> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>>
>>> [snip] There was a thread here a while back about
>>> whether an object can be bigger than SIZE_MAX. It probably can, so
>>> the compiler would not be able to make the assumption [that an
>>> index into an array can never exceed some threshold].

>>
>> Since it is the implementation (ie, the compiler) that is making
>> the decision (as to whether it ever creates such objects), it is
>> perfectly free to assume that it made whatever choice it made!

>
> I used the word "compiler" deliberately.


HA! Well aren't you the crafty old dodger?

> Where there is a unified
> implementation, then I agree that one part (the optimiser) can assume
> facts about other parts (for example the C library), but that's not been
> my experience. Most compilers have to work without knowing what
> implementation they will end up part of.


Right. I assumed you were talking about a constraint imposed by
the Standard, but actually you were talking about a constraint
imposed by a desire to be included in more than one implementation,
which is rather a different kettle of fish.
Reply With Quote
  #104 (permalink)  
Old 02-12-2012, 10:18 PM
Phil Carmody
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly

Tim Rentsch <txr@alumni.caltech.edu> writes:
> jacob navia <jacob@spamsink.net> writes:
>
> > Le 29/12/11 02:12, Phil Carmody a @C3{A9}crit :
> >>
> >> Stop right there. Why can they not have their top bits set?
> >> Are you presuming I can't have an array of over 2^31 ints?

> >
> > The original program uses "int" to index, and "int" under windows
> > linux and mac is at most 32 bits. This is not a problem of assembly
> > but of the original C program
> >
> >> Can you justify that presumption by citing a relevant C standard?

> >
> > Yes. In the parafraph "Translation limits" the standard says that
> > an implemetation is not required to support objects with
> > more than 65535 bytes.

>
> The depth of understanding reflected in this statement
> is truly awe inspiring.


If we can see depth being reflected, then we must be below
that statement.

Phil
--
Unix is simple. It just takes a genius to understand its simplicity
-- Dennis Ritchie (1941-2011), Unix Co-Creator
Reply With Quote
  #105 (permalink)  
Old 02-17-2012, 03:58 PM
io_x
Guest
 
Posts: n/a
Default Re: Performance of hand-optimised assembly


"io_x" <a@b.c.invalid> ha scritto nel messaggio
news:4ef7fd4d$0$1389$4fafbaef@reader2.news.tin.it. ..

> someone answered to that says
> that people that code *all* in asm
> has not need to speed programs


than there is a better scaling for complexity too
i would know it should be so
i leave the question why to esperts...


Reply With Quote
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off




All times are GMT. The time now is 09:13 AM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.