|
|||
|
On 07-Jan-12 03:33, jacob navia wrote:
> Le 07/01/12 06:39, Stephen Sprunk a écrit : >> On 25-Dec-11 14:35, jacob navia wrote: >>> Look at this: >>> >>> void doADD(double *L, double *R, double *result,size_t n) >>> { >>> for(size_t i=0; i<n;i++) result[i]=L[i]+R[i]; >>> } >>> >>> Assembly will always do better than standard C in the above specific >>> code. You can then use the ADDPD and add 2 integers at the same time >>> (under specific loop conditions). >> >> Really? I don't know how to get GCC to assume the arrays above are >> properly aligned, though I'm sure there's a way. > > Maybe, but that is exactly my point. Why couldn't we be able to > DECLARE that those arrays are properly aligned and are suitable > for SIMD processing? In the code above, there are three pointers, which may or may not be suitably aligned for SIMD instructions. If they were declared as arrays (which is the slight modification I made), the compiler will know they're aligned. > Why should we expect that the compiler implements a HUGE and > brittle machinery to DEDUCE that instead? It's not huge or brittle; it's just one AND operation and conditional branch per pointer. GCC doesn't bother to do it, for reasons I don't understand, but it is not an unreasonable optimization given the performance benefits of aligned access. > Wouldn't it be simpler if we would just be able to declare those > in the same way that we say: > int i; > instead of expecting the commpiler to DEDUCE that because of the > usage of that variable it must be an int? Huh? What I did was take L, R and result out of the parameter list and declare them like this: double L[256], R[256], result[256]; That is sufficient for the compiler to know they're properly aligned, unlike a (double *). S -- Stephen Sprunk "God does not play dice." --Albert Einstein CCIE #3723 "God is an inveterate gambler, and He throws the K5SSS dice at every possible opportunity." --Stephen Hawking |
|
|
||||
|
||||
|
|
|
|||
|
In article <je927u$g2m$1@speranza.aioe.org>,
jacob navia <jacob@jspamsink.org> wrote: > do operation 1 > do operation 2 > ... etc My brain is warped from reading too much compiler output, so stuff like this makes sense: A part 1 A part 2 B part 1 A part 4 X part 8 D part 2 |
|
|||
|
Stephen Sprunk <stephen@sprunk.org> writes:
> On 07-Jan-12 03:33, jacob navia wrote: >> Le 07/01/12 06:39, Stephen Sprunk a écrit : >>> On 25-Dec-11 14:35, jacob navia wrote: <snip> >>>> void doADD(double *L, double *R, double *result,size_t n) >>>> { >>>> for(size_t i=0; i<n;i++) result[i]=L[i]+R[i]; >>>> } <snip> > In the code above, there are three pointers, which may or may not be > suitably aligned for SIMD instructions. If they were declared as arrays > (which is the slight modification I made), the compiler will know > they're aligned. My gcc (4.6.1) generates addpd instructions with only -O3, even when L, R and result are pointers. I wonder if the is a gcc version difference or something to do with my system (64-bit Linux). <snip> -- Ben. |
|
|||
|
On 07-Jan-12 15:03, Ben Bacarisse wrote:
> Stephen Sprunk <stephen@sprunk.org> writes: >> On 07-Jan-12 03:33, jacob navia wrote: >>> Le 07/01/12 06:39, Stephen Sprunk a écrit : >>>> On 25-Dec-11 14:35, jacob navia wrote: > <snip> >>>>> void doADD(double *L, double *R, double *result,size_t n) >>>>> { >>>>> for(size_t i=0; i<n;i++) result[i]=L[i]+R[i]; >>>>> } > <snip> >> In the code above, there are three pointers, which may or may not be >> suitably aligned for SIMD instructions. If they were declared as arrays >> (which is the slight modification I made), the compiler will know >> they're aligned. > > My gcc (4.6.1) generates addpd instructions with only -O3, even when L, > R and result are pointers. I wonder if the is a gcc version difference > or something to do with my system (64-bit Linux). Probably; my gcc is 4.2.4, which is a fairly old branch. I'll see if I can talk my sysadmin into upgrading. S -- Stephen Sprunk "God does not play dice." --Albert Einstein CCIE #3723 "God is an inveterate gambler, and He throws the K5SSS dice at every possible opportunity." --Stephen Hawking |
|
|||
|
88888 Dihedral於 2012年1月7日星期*UTC+8下午6時15分06秒 道:
> 88888 Dihedral於 2012年1月6日星期五UTC+8上午5時39分31秒 道: > > jacob navia於 2012年1月5日星期四UTC+8上午3時22分25秒 道: > > > Le 04/01/12 18:59, Stephen Sprunk a écrit : > > > > On 26-Dec-11 14:36, jacob navia wrote: > > > >> Le 26/12/11 21:24, Stephen Sprunk a écrit : > > > >>> On 25-Dec-11 14:35, jacob navia wrote: > > > >>>> Assembly will always do better than standard C in the above specific > > > >>>> code. You can then use the ADDPD and add 2 integers at the same time > > > >>>> (under specific loop conditions). > > > >>> > > > >>> ... unless you have a vectorizing compiler, which will do the exact same > > > >>> thing for you without the need for inline assembly. > > > >> > > > >> Sure. That was a simple example, and no, gcc doesn't do it, at least > > > >> in the version I have > > > > > > > > Like many GCC sub-projects, the work isn't complete, and probably isn't > > > > on by default: > > > > http://gcc.gnu.org/projects/tree-ssa/vectorization.html > > > > > > > > > > > > > Note that those instructions are available since at least 10 YEARS > > > and still mainstream compilers like MSVC or GCC do not perform any of > > > the magic you invoke... > > > > > > Sure, maybe ONE DAY IN THE FAR FUTURE they will do it, but until thenI > > > am using those instructions in my programs now... > > > > > > > > > > GCC isn't the only compiler out there, either; I've heard ICC is very > > > > good at this, for instance. > > > > > > > > > > Yes, you need ALLL THE RESOURCES of Intel Corp to build such a compiler, > > > no wonder nobody has done that besides Intel. > > > > > > >>> Also, when AVX support comes out, that vectorizing compiler will start > > > >>> using AVX instructions while your assembly is stuck using SSE. > > > >> > > > >> Sure, you can recompile your program bu I can't change a few loop > > > >> `conditions since I am "stuck" with my old program, poor me. > > > > > > > > AVX will be different instructions, different registers, etc. > > > > > > What? > > > > > > Look Stephen, you are surely knowledgable in C but in assembly... > > > Take for instance ANDPS (AND NOT PACKED Single precision) > > > > > > The corresponding AVX instructions is... > > > VANDNPS, with the same syntax same semantics. You just > > > add a "V" before the instruction mnemonic. > > > > > > All you have to do is use the ymm registers with 256 bits > > > instead of the xmm registers with 128 bits. Your loop > > > counters must be adjusted (using only half as many loops) > > > and that was it, maybe an hour of work for a big program. > > > > > > > > > > It won't > > > > be a complete rewrite from scratch, but it'll be a lot of work, whereas > > > > someone with a vectorizing compiler can just change one option and > > > > recompile. > > > > > > > > > > Yes, but it will be at least 10 years before gcc or MSVC propose that.. > > > gcc doesn't even propose automatically SSE2 or SSE3 TODAY, 10 years > > > after it was announced by Intel... > > > > > > > > > >>> Furthermore, there is no guaranteed win for using SEE; early SSE chips > > > >>> simply decoded that into two FADDs, so there was no performance benefit > > > >>> to all that work you invested in writing SSE assembly. > > > >> > > > > > > That was in 2000, when it first appeared. Pleeeeeze, we are 2012 now, > > > that is no longer relevant at all. > > > > > > > > > But let's forget about that. Look at this. > > > > > > problem: > > > > > > You have a vector of 8 long long integers and you want to shift that > > > vector by a given amount left between 1 and 63 bits. > > > > > > Interface: > > > > > > void ShiftVector(long long *vector, int AmountToShift); > > > ; Pointer to vector in %rdi > > > ; Amount to shift in %rsi > > > ; gcc calling conventions for x86-64 > > > shiftvector: > > > movq %rsi,%rcx > > AmountToShift -> rcx > > > > > > > > movq (%rdi),%rsi > > > > vector[0] -> rsi > > > movq 8(%rdi),%rax > > vector[1] -> rax > > > shldq %cl,%rax,%rsi > > > > Do the shift to the left. > > the (rax, rsi) bit shifted result in rsi and rax not changed > > > > > movq %rsi,(%rdi) > > > > Save the rsi part. Leave the rax part. > > > > > > movq 16(%rdi),%rsi > > vector[2] -> rsi > > > > > > > shldq %cl,%rsi,%rax > > Do the shift. > > the (rsi, rax) bit shifted result in rax and rsi not changed > if there's an instruction to do this directly > > > > > movq %rax,8(%rdi) > > > > Save the rax part. Leave the rsi part. > > > > > > > > > > movq 24(%rdi),%rax > > > shldq %cl,%rax,%rsi > > > movq %rsi,16(%rdi) > > > > > > movq 32(%rdi),%rsi > > > shldq %cl,%rsi,%rax > > > movq %rax,24(%rdi) > > > > > > movq 40(%rdi),%rax > > > shldq %cl,%rax,%rsi > > > movq %rsi,32(%rdi) > > > > > > movq 48(%rdi),%rsi > > > shldq %cl,%rsi,%rax > > > movq %rax,40(%rdi) > > > > > > movq 56(%rdi),%rax > > > shldq %cl,%rax,%rsi > > > movq %rsi,48(%rdi) > > > > > > shlq %cl,%rsi > > > movq %rsi,56(%rdi) > > > > > > ret > > > > > > 26 instructions, 97 bytes (!!!) > > > > > > Write that in C and see how big your procedure is, and how slow :-) RBI, RCI, and REI not used, thus it's possible to speed up more by dual or quadro-cores in X86. |
|
|||
|
On Fri, 23 Dec 2011 18:43:32 +0000
Ben Bacarisse <ben.usenet@bsb.me.uk> wrote: > I'm starting a new thread, but it is prompted by this remark from > 88888 Dihedral <dihedral88888@googlemail.com> about a short binary > search function I posted: > > | Do you consider to convert 5 to 10 lines of C to assembly? > | I did that 20 years ago on X86 cpu. > | It was a piece of cake to gain the 20-40% speed required if paid > | well. > > I've felt (it's only a feeling) that hand coding for speed (rather > than to access features not available in C like extra wide multiplies > and so on) was a thing of the past. But maybe I'm wrong. Is it > still possible to get a significant speed-up over C by hand coding? There are a few corner cases, where hand written assembly makes compiled code see only the rear lights. But those are few and far between. The problem is, that modern CPUs have so much side state, that it's next to impossible to keep track of all the little details, that may seem insignificant, but have a huge effect. Just the order of independent instruction blocks can have a large impact. However where hand written assembler is still useful is implementing the inner loops of complex algorithms processing (very) large datasets. Such algorithms usually involve only shuffling numbers around in a strict layout, so it's easy to reason about this kind of task and find patterns, that a compiler won't. And it allows to exploit peculiarities of the used instruction set, a compiler never could do. Like the one described by Dark Shikari here: http://stackoverflow.com/a/98251/524368 > How much depends on the processor? How good are the modern > optimising compilers that would have to be beaten? Depends on the task at hand. If it's just about the normal execution path of a program without many loops the compiler will most likely win, because will put the code in a lot of configurations through some simulation and count "clock cycles". Also there's s lot of heuristics in it. The other case are aforementioned processing loops. You as the algorithm, writer know about the dependencies in data access, know where pointers can possibly point and where not (the compiler doesn't), which allows to gain significant performance boosts in such situations. Wolfgang |
|
|||
|
Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> James Harris <james.harris.1@googlemail.com> writes: > >> On Dec 25, 12:07 am, Ben Bacarisse <ben.use...@bsb.me.uk> wrote: >>> James Harris <james.harri...@googlemail.com> writes: >>> > On Dec 23, 6:43 pm, Ben Bacarisse <ben.use...@bsb.me.uk> wrote: >> >> ... >> >>> >> size_t binsearch(int a[], size_t length, int key) >>> >>> > Why use size_t instead of unsigned int or unsigned long int, >>> > especially as you add it to sum in the test harness which is long int? >>> > Why not use unsigned for sum in the test harness? >>> >>> int is better yet because the output will then be same across machines. >>> Using *any* unsigned type means the "not found" -1 return becomes a >>> value that can vary from implementation to implementation. >> >> You mean -1 may have a strange value on a 1s-complement machine? On a >> 2s-complement machine wouldn't it always be all bits set even if >> assigned to an unsigned variable? > > No, the machine representation of -1 has no bearing on the result of the > C code[1]. What I mean is just that a return of -1 has a value of > SIZE_MAX and that varies from implementation to implementation. It's a > good value to return in a real binary search because a valid index can't > ever be (quite) that large -- the largest C-supported object of SIZE_MAX > bytes can have indexes no higher than SIZE_MAX - 1. > > However, it's not a good return for a portable test program like this > because the sum of the returns will vary from machine to machine. > Alternatively, changing the simple sum to something like this: > > size_t idx = binsearch(data, length, key) > sum += idx == (size_t)-1 ? -1 : idx; > > could have made the output more portable. [snip rest] Presumably you meant something different for that last line of code, since in most cases the effect would be the same as sum += idx; |
|
|||
|
Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> [snip] There was a thread here a while back about > whether an object can be bigger than SIZE_MAX. It probably can, so > the compiler would not be able to make the assumption [that an > index into an array can never exceed some threshold]. Since it is the implementation (ie, the compiler) that is making the decision (as to whether it ever creates such objects), it is perfectly free to assume that it made whatever choice it made! |
|
|||
|
jacob navia <jacob@spamsink.net> writes:
> Le 29/12/11 02:12, Phil Carmody a @C3{A9}crit : >> >> Stop right there. Why can they not have their top bits set? >> Are you presuming I can't have an array of over 2^31 ints? > > The original program uses "int" to index, and "int" under windows > linux and mac is at most 32 bits. This is not a problem of assembly > but of the original C program > >> Can you justify that presumption by citing a relevant C standard? > > Yes. In the parafraph "Translation limits" the standard says that > an implemetation is not required to support objects with > more than 65535 bytes. The depth of understanding reflected in this statement is truly awe inspiring. |
|
|||
|
"BartC" <bc@freeuk.com> writes:
> "Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message > news:87mxacyyd8.fsf@bazspaz.fatphil.org... >> James Harris <james.harris.1@googlemail.com> writes: > >>> Understood. When I came to code this in 32-bit assembler I realised >>> that, since high and low can never have their top bits set, >> >> Stop right there. Why can they not have their top bits set? >> Are you presuming I can't have an array of over 2^31 ints? >> Can you justify that presumption by citing a relevant C standard? >> Have you failed to notice that there are x86-based boxes with >> 256GB RAM on the market nowadays, and 8GiB arrays shouldn't >> be considered unexpected any more. > > The address calculation for int-arrays usually involves multiplying by > 2, 4 or 8. Conceptually, but the actual semantics isn't defined in terms of multiplication; it only says that the resulting pointer must point to the appropriate array element. (Incidental note: in fact the Standard even uses the word "conceptually" in a footnote in the section on array indexing.) > With an index having the top bit set, that will overflow > the size_t range. Whatever size_t happens to be. The Standard doesn't contain any statement about the type in which array indexing is carried out; the defining expressions are mathematical, not given in terms of C operators. As long as the array object referred to is big enough, array indexing has to work even if the index is less than PTRDIFF_MIN or greater than PTRDIFF_MAX or SIZE_MAX. |
|
|||
|
Tim Rentsch <txr@alumni.caltech.edu> writes:
> Ben Bacarisse <ben.usenet@bsb.me.uk> writes: <snip> >> size_t idx = binsearch(data, length, key) >> sum += idx == (size_t)-1 ? -1 : idx; <snip> > Presumably you meant something different for that last > line of code, since in most cases the effect would be > the same as > > sum += idx; Yes, I intended something like if (idx == (size_t)-1) sum -= 1; else sum += idx; but turned it into an entirely not equivalent conditional expression. -- Ben. |
|
|||
|
Tim Rentsch <txr@alumni.caltech.edu> writes:
> Ben Bacarisse <ben.usenet@bsb.me.uk> writes: > >> [snip] There was a thread here a while back about >> whether an object can be bigger than SIZE_MAX. It probably can, so >> the compiler would not be able to make the assumption [that an >> index into an array can never exceed some threshold]. > > Since it is the implementation (ie, the compiler) that is making > the decision (as to whether it ever creates such objects), it is > perfectly free to assume that it made whatever choice it made! I used the word "compiler" deliberately. Where there is a unified implementation, then I agree that one part (the optimiser) can assume facts about other parts (for example the C library), but that's not been my experience. Most compilers have to work without knowing what implementation they will end up part of. -- Ben. |
|
|||
|
Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> Tim Rentsch <txr@alumni.caltech.edu> writes: > >> Ben Bacarisse <ben.usenet@bsb.me.uk> writes: >> >>> [snip] There was a thread here a while back about >>> whether an object can be bigger than SIZE_MAX. It probably can, so >>> the compiler would not be able to make the assumption [that an >>> index into an array can never exceed some threshold]. >> >> Since it is the implementation (ie, the compiler) that is making >> the decision (as to whether it ever creates such objects), it is >> perfectly free to assume that it made whatever choice it made! > > I used the word "compiler" deliberately. HA! Well aren't you the crafty old dodger? > Where there is a unified > implementation, then I agree that one part (the optimiser) can assume > facts about other parts (for example the C library), but that's not been > my experience. Most compilers have to work without knowing what > implementation they will end up part of. Right. I assumed you were talking about a constraint imposed by the Standard, but actually you were talking about a constraint imposed by a desire to be included in more than one implementation, which is rather a different kettle of fish. |
|
|||
|
Tim Rentsch <txr@alumni.caltech.edu> writes:
> jacob navia <jacob@spamsink.net> writes: > > > Le 29/12/11 02:12, Phil Carmody a @C3{A9}crit : > >> > >> Stop right there. Why can they not have their top bits set? > >> Are you presuming I can't have an array of over 2^31 ints? > > > > The original program uses "int" to index, and "int" under windows > > linux and mac is at most 32 bits. This is not a problem of assembly > > but of the original C program > > > >> Can you justify that presumption by citing a relevant C standard? > > > > Yes. In the parafraph "Translation limits" the standard says that > > an implemetation is not required to support objects with > > more than 65535 bytes. > > The depth of understanding reflected in this statement > is truly awe inspiring. If we can see depth being reflected, then we must be below that statement. Phil -- Unix is simple. It just takes a genius to understand its simplicity -- Dennis Ritchie (1941-2011), Unix Co-Creator |
|
|||
|
"io_x" <a@b.c.invalid> ha scritto nel messaggio news:4ef7fd4d$0$1389$4fafbaef@reader2.news.tin.it. .. > someone answered to that says > that people that code *all* in asm > has not need to speed programs than there is a better scaling for complexity too i would know it should be so i leave the question why to esperts... |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|