|
|||
|
James Giles <jamesgiles@worldnet.att.net> wrote:
> If the compiler make *inconsistent* use of extended precision, > Kahan's method can actually be less precise than the simple > sum. This is because Kahan's algorithm is keeping track of > the discrepancy between the simple sum and what should be > calculated and adding it back in at each step. x87 implementations are well known for inconsistent use. It is expensive to save/reload data frequently, and there aren't enough x87 registers to keep complicated expressions completely in registers. The original 8087 was supposed to be designed to have an infinite stack, with registers being spilled to memory when needed. It turned out not to be possible to actually write the needed software in the end. I don't know if the problems were fixed in later 80x87 implementations, but it doesn't seem that it has been done yet, as far as I know. -- glen |
|
|
||||
|
||||
|
|
|
|||
|
glen herrmannsfeldt wrote:
> James Giles <jamesgiles@worldnet.att.net> wrote: > >> If the compiler make *inconsistent* use of extended precision, >> Kahan's method can actually be less precise than the simple >> sum. This is because Kahan's algorithm is keeping track of >> the discrepancy between the simple sum and what should be >> calculated and adding it back in at each step. > > > x87 implementations are well known for inconsistent use. That's why I pointed that out. > It is expensive to save/reload data frequently, and there > aren't enough x87 registers to keep complicated expressions > completely in registers. But, the instruction to set the precision mode flag in the x87 is simple and fast. It doesn't restrict the exponent range, but we were talking about consistent precision. -- J. Giles "I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies." -- C. A. R. Hoare |
|
|||
|
"James Giles" <jamesgiles@worldnet.att.net> wrote in
news:xQWWf.655736$qk4.421681@bgtnsc05-news.ops.worldnet.att.net: > glen herrmannsfeldt wrote: >> James Giles <jamesgiles@worldnet.att.net> wrote: >> >>> If the compiler make *inconsistent* use of extended precision, >>> Kahan's method can actually be less precise than the simple >>> sum. This is because Kahan's algorithm is keeping track of >>> the discrepancy between the simple sum and what should be >>> calculated and adding it back in at each step. >> >> >> x87 implementations are well known for inconsistent use. > > That's why I pointed that out. > >> It is expensive to save/reload data frequently, and there >> aren't enough x87 registers to keep complicated expressions >> completely in registers. > > But, the instruction to set the precision mode flag in the x87 > is simple and fast. It doesn't restrict the exponent range, but > we were talking about consistent precision. > This can be a mixed blessing. On the example which started this thread, forcing 64-bit stores yields the wrong answer, while keeping intermediates in the x87 gives the right answer. Of course if you're using an atgorithm where you _know_ consistencey is needed, that's another matter. -- *********** To reply by e-mail, make w single in address ************** |
|
|||
|
Ian Gay wrote:
> "James Giles" <jamesgiles@worldnet.att.net> wrote .... >> But, the instruction to set the precision mode flag in the x87 >> is simple and fast. It doesn't restrict the exponent range, but >> we were talking about consistent precision. >> > This can be a mixed blessing. On the example which started this > thread, forcing 64-bit stores yields the wrong answer, while > keeping intermediates in the x87 gives the right answer. Of course > if you're using an atgorithm where you _know_ consistencey is > needed, that's another matter. In fact, Kahan's algorithm should be done in the extended precision anyway. Unless you're unrolling the loop to do the SIMD parallel instructions, the loop is simple enough that all its data should fit into the floating-point processor's stack easily (there are 16 registers in the stack). s = 0; c = 0; y = 0; t = 0 do i = 1, N ! here, the 'live' values are S and C y = x(i) - c ! here, the 'live' values are S and Y t = s + y ! here, the 'live' values are S, T, and Y c = (t-s)-y ! here, the 'live' values are T and C s = t ! this last statement is just a renaming of T for ! the next pass through the loop - it's a no op. ! here, the 'live' values are S and C just like at ! the start of the loop. end do Only the 'live' values need be on the stack. And, in the first assignment you briefly need room for x(i). Well, the most 'live' values you have at any time is three. That seems it ought to fit in a stack that's 16 deep. You have to access some of the values out of stack order, but all the x87 math instructions allow for that. So, you can keep the values on the stack and use full extended precision for the loop and be consistent. This means you can sum with effectively more precision than the simple summation loop could using quad. Of course the answer (S) will only hold the double precision part of the sum. In cases like this thread, where the actual final answer is exactly representable in double, but the intermediate calculations are not (but are still within twice the extended precision), this gives just what you need. Cost: It's four adds per loop instead of one - that's the main cost. But there should be no more memory traffic than for the simple summation. Whether compilers find this implementation is a different story. At least a *little* data flow analysis is required. That cost is to test the compiler and choose a good one (even if it's not "free"). -- J. Giles "I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies." -- C. A. R. Hoare |
|
|||
|
James Giles <jamesgiles@worldnet.att.net> wrote:
(snip regarding precision of sums, extended precision, and inconsistent precision) > In fact, Kahan's algorithm should be done in the extended > precision anyway. Unless you're unrolling the loop to do > the SIMD parallel instructions, the loop is simple enough > that all its data should fit into the floating-point processor's > stack easily (there are 16 registers in the stack). Last I knew, it was eight. > s = 0; c = 0; y = 0; t = 0 > do i = 1, N > ! here, the 'live' values are S and C > y = x(i) - c > ! here, the 'live' values are S and Y > t = s + y > ! here, the 'live' values are S, T, and Y > c = (t-s)-y > ! here, the 'live' values are T and C > s = t > ! this last statement is just a renaming of T for > ! the next pass through the loop - it's a no op. > ! here, the 'live' values are S and C just like at > ! the start of the loop. > end do I believe one of the causes of stack flush is a subroutine call, where the called routine will expect to have the full stack. While the data could be stored in extended precision temporaries around a call, I don't know any that do that. > Only the 'live' values need be on the stack. And, in the > first assignment you briefly need room for x(i). Well, > the most 'live' values you have at any time is three. That > seems it ought to fit in a stack that's 16 deep. You have to > access some of the values out of stack order, but all the > x87 math instructions allow for that. My guess is you have a good chance, at least with some optimizations turned on. (snip) > Cost: It's four adds per loop instead of one - that's the main cost. > But there should be no more memory traffic than for the simple > summation. Whether compilers find this implementation is a > different story. At least a *little* data flow analysis is required. > That cost is to test the compiler and choose a good one (even if > it's not "free"). -- glen |
|
|||
|
glen herrmannsfeldt wrote:
> James Giles <jamesgiles@worldnet.att.net> wrote: > (snip regarding precision of sums, extended precision, > and inconsistent precision) > >> In fact, Kahan's algorithm should be done in the extended >> precision anyway. Unless you're unrolling the loop to do >> the SIMD parallel instructions, the loop is simple enough >> that all its data should fit into the floating-point processor's >> stack easily (there are 16 registers in the stack). > > Last I knew, it was eight. You're right. I was in a hurry and didn't check. (Shows how often I do any float assembly on an x87!) Still, only three registers are needed. Eight seems enough. > I believe one of the causes of stack flush is a subroutine call, > where the called routine will expect to have the full stack. > While the data could be stored in extended precision temporaries > around a call, I don't know any that do that. Of course, there are no procedure calls in the Kahan summation algorithm, an any interrupt handler that changes the content of *any* registers is just broken. -- J. Giles "I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies." -- C. A. R. Hoare |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|