|
|||
|
The g95 status page mentions that support for quad precision
arithmetic in g95 is `coming soon...'. I can't find anything related to quadruple precision arithmetic on the gfortran site. Can somebody comment on the status of the quadruple precision arithmetic support in both g95 and gfortran? I assume it isn't available yet? Anybody who can tell me how far away we are from it? Regards, Bart -- "Share what you know. Learn what you don't." |
|
|
||||
|
||||
|
|
|
|||
|
Bart Vandewoestyne wrote:
> The g95 status page mentions that support for quad precision > arithmetic in g95 is `coming soon...'. > > I can't find anything related to quadruple precision arithmetic > on the gfortran site. > > Can somebody comment on the status of the quadruple precision > arithmetic support in both g95 and gfortran? I assume it isn't > available yet? Anybody who can tell me how far away we are from it? > You may have to search the fortran@gcc.gnu.org list archives (to learn about gfortran progress) yourself, as it's not clear what you want. Surely, the question of quad arithmetic will always be target dependent. gfortran presumably already supports it on a few targets where it's readily available, but those aren't the most popular ones. ia64 quad appears to depend mainly on someone doing the work on ia64.md, where there is support for 80-bit but not 128-bit floating point. In case you don't see the point, this is a language independent part of gcc. I wouldn't bet on that changing, even with all the publicity about financial commitments to gcc development. |
|
|||
|
On 2006-03-27, Tim Prince <tprince@nospamcomputer.org> wrote:
> > You may have to search the fortran@gcc.gnu.org list archives (to learn > about gfortran progress) yourself, as it's not clear what you want. > Surely, the question of quad arithmetic will always be target dependent. > gfortran presumably already supports it on a few targets where it's > readily available, but those aren't the most popular ones. Sorry, you are right... i should specify more I guess... If I say 'quadruple precision on i386 architectures', does that make my question more clear then? Actually, what I basically would like to know is when g95 will be able to compute with the same maximum precision as for example ifort 9.0 can do on my Linux i386 box. Suppose I use the following numeric kinds: integer, parameter, public :: sp = kind(1.0) integer, parameter, public :: dp = selected_real_kind(2*precision(1.0_sp)) integer, parameter, public :: qp_preferred = & selected_real_kind(2*precision(1.0_dp)) integer, parameter, public :: qp = (1+sign(1,qp_preferred))/2*qp_preferred+ & (1-sign(1,qp_preferred))/2*dp then qp is available and a higher precision then dp if i compile with ifort 9.0 on my Debian GNU/Linux box on i386. If I compile this with g95 then qp is the same numeric kind as dp. Regards, Bart -- "Share what you know. Learn what you don't." |
|
|||
|
Bart,
right now, g95 doesn't support real*16, but does support real*10 on the appropriate hardware. So I think you could use qp_preferred = selected_real_kind(1+precision(1.0_dp)) to get real*10, but nothing more than that. An important issue is of course that if the hardware doesn't support real*16 it is not only very slow to compute with real*16 quantities, but also a lot of work to implement all operations required for real*16 calculations. Joost |
|
|||
|
Computing with quad precision is really slow by the fact that it is
software implemented, at least with Intel Fortran Compiler. It is software implemented because the is no system (is there?) that supports 128 bits floating point arithmetics. In the case of extended precision (80 bits fp arithmetics), several systems has hardware support for this, that's true for IA32, EM64T, AMD64, etc. So you can expect less performance hit from extended precision than from quad precision, when compared to double precision. Bernhard. |
|
|||
|
Bernhard Enders <bgeneto@gmail.com> wrote:
> Computing with quad precision is really slow by the fact that it is > software implemented, at least with Intel Fortran Compiler. It is > software implemented because the is no system (is there?) that supports > 128 bits floating point arithmetics. IBM S/370 supports 128 but except for divide, which is done in software. In ESA/390 and later divide is done in hardware. The 360/85 was the original machine supporting this format. VAX has H-float, but it is done through software emulation on most models. I was told that the VAX 11/730 supports it (in microcode), the slowest of the non-micro VAXs. -- glen |
|
|||
|
Bernhard Enders wrote:
> Computing with quad precision is really slow by the fact that it is > software implemented, at least with Intel Fortran Compiler. It is > software implemented because the is no system (is there?) that supports > 128 bits floating point arithmetics. In the case of extended precision > (80 bits fp arithmetics), several systems has hardware support for > this, that's true for IA32, EM64T, AMD64, etc. So you can expect less > performance hit from extended precision than from quad precision, when > compared to double precision. > Several architectures include hardware support for 128-bit floating point by combinations of instructions. Among those still in production are IA64 and IBM Power. The former has both 80-bit and 128-bit IEEE-style, generally at most one of which is implemented with a single set of options. The latter has a non-IEEE compliant version of somewhat less precision. The 80-bit arithmetic might be implemented with 128-bit storage (48 bits unused), due to the alignment requirements for efficiency. |
|
|||
|
Thanks for your information concerning 64 bits architecture. This is a
bit OT. I would like to know where can I read more technical information about the "alignment requirements for efficiency", i.e., the fact that 80 bits calculations are performed using 128 bits registers (I have heard about this but can't remember where)? And why in the world we don't have 128 bits arithmetics on 'popular' architectures such as ia32 or on AMD64 if there exist 128 bits registers on these architectures? Is it so difficult (or costly) to implement 128 bits operations or there are no interest in doing this? Just for information, it follows an exerpt from AMD64 architecture manual vol. 1 showing that it has 16x128bits registers with media instructions (why no fp support at all??): "The AMD64 architecture provides three floating-point instruction subsets, using three distinct register sets: - 128-Bit Media Instructions support 32-bit single-precision and 64-bit double-precision floating-point operations, in addition to integer operations." Best regards, Bernhard. |
|
|||
|
Bernhard Enders wrote:
> Thanks for your information concerning 64 bits architecture. This is a > bit OT. I would like to know where can I read more technical > information about the "alignment requirements for efficiency", i.e., > the fact that 80 bits calculations are performed using 128 bits > registers (I have heard about this but can't remember where)? And why > in the world we don't have 128 bits arithmetics on 'popular' > architectures such as ia32 or on AMD64 if there exist 128 bits > registers on these architectures? Is it so difficult (or costly) to > implement 128 bits operations or there are no interest in doing this? > Just for information, it follows an exerpt from AMD64 architecture > manual vol. 1 showing that it has 16x128bits registers with media > instructions (why no fp support at all??): > > "The AMD64 architecture provides three floating-point instruction > subsets, using three distinct register sets: > - 128-Bit Media Instructions support 32-bit single-precision and 64-bit > double-precision floating-point operations, in addition to integer > operations." > The 80-bit fp calculations do use the 80-bit x87 registers. You may find something about the recommendation for 128-bit data alignment in the CPU manufacturers' software guides. A packed array of 80-bit data would involve frequent multiple accesses to cache and memory, including data which straddle cache line boundaries. Few Fortran compilers support this format, in spite of it being supported in nearly all C compilers for linux (but few for Windows). Most Fortran compilers now do support the 128-bit parallel mode with auto-vectorization, but that is a big diversion from the original topic of quad precision. No one calls the 4 simultaneous single (or paired double) precision operations "quad precision". The details of implementation vary with hardware type; full width parallel operation on Intel desktop CPUs, paired 64-bit width floating point units on AMD desktops, splitting into a pair of closely pipelined 64-bit operations on pentium-m, all with the same binary software code. |
|
|||
|
You are right about the cost of QP. Skip Knoble
Program Qtime ! Sample Program to illustrate DP versus QP compute times: ! Intel Fortran V9.0-5748 on AMD Opteron 852 with O2. integer, parameter :: QDP = selected_real_kind(30) real(kind=QDP) :: x, sum real :: T1,T2, Seconds integer :: i, pulses, PPS x=1.5_QDP sum=0.0_QDP CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS) T1 = REAL(Pulses,QDP)/PPS Do I=1,100000000 sum=sum+I*x end do CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS) T2 = REAL(Pulses,QDP)/PPS Seconds=T2-T1 print *, " Time in seconds: ",Seconds print *, "QDP=",QDP print *, "Sum=",Sum end Program Qtime Output for: integer, parameter :: QDP = selected_real_kind(15) Time in seconds: 0.5800781 QDP= 8 Sum= 7.500000080627340E+015 Output for: integer, parameter :: QDP = selected_real_kind(30) Time in seconds: 5.750000 QDP= 16 Sum= 7500000075000000.00000000000000000 On Tue, 28 Mar 2006 14:20:04 GMT, Tim Prince <tprince@nospamcomputer.org> wrote: -|Bernhard Enders wrote: -|> Thanks for your information concerning 64 bits architecture. This is a -|> bit OT. I would like to know where can I read more technical -|> information about the "alignment requirements for efficiency", i.e., -|> the fact that 80 bits calculations are performed using 128 bits -|> registers (I have heard about this but can't remember where)? And why -|> in the world we don't have 128 bits arithmetics on 'popular' -|> architectures such as ia32 or on AMD64 if there exist 128 bits -|> registers on these architectures? Is it so difficult (or costly) to -|> implement 128 bits operations or there are no interest in doing this? -|> Just for information, it follows an exerpt from AMD64 architecture -|> manual vol. 1 showing that it has 16x128bits registers with media -|> instructions (why no fp support at all??): -|> -|> "The AMD64 architecture provides three floating-point instruction -|> subsets, using three distinct register sets: -|> - 128-Bit Media Instructions support 32-bit single-precision and 64-bit -|> double-precision floating-point operations, in addition to integer -|> operations." -|> -|The 80-bit fp calculations do use the 80-bit x87 registers. You may -|find something about the recommendation for 128-bit data alignment in -|the CPU manufacturers' software guides. A packed array of 80-bit data -|would involve frequent multiple accesses to cache and memory, including -|data which straddle cache line boundaries. Few Fortran compilers support -|this format, in spite of it being supported in nearly all C compilers -|for linux (but few for Windows). -|Most Fortran compilers now do support the 128-bit parallel mode with -|auto-vectorization, but that is a big diversion from the original topic -|of quad precision. No one calls the 4 simultaneous single (or paired -|double) precision operations "quad precision". The details of -|implementation vary with hardware type; full width parallel operation on -|Intel desktop CPUs, paired 64-bit width floating point units on AMD -|desktops, splitting into a pair of closely pipelined 64-bit operations -|on pentium-m, all with the same binary software code. |
|
|||
|
Herman D. Knoble <SkipKnobleLESS@SPAMpsu.DOT.edu> wrote in
news:4dni22t33u3i6u6ctruig5iltu9he5d6im@4ax.com: > You are right about the cost of QP. Skip Knoble > > Program Qtime > ! Sample Program to illustrate DP versus QP compute times: > ! Intel Fortran V9.0-5748 on AMD Opteron 852 with O2. > > integer, parameter :: QDP = selected_real_kind(30) > real(kind=QDP) :: x, sum > real :: T1,T2, Seconds > integer :: i, pulses, PPS > > x=1.5_QDP > sum=0.0_QDP > CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS) > T1 = REAL(Pulses,QDP)/PPS > > Do I=1,100000000 > sum=sum+I*x > end do > > CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS) > T2 = REAL(Pulses,QDP)/PPS > Seconds=T2-T1 > print *, " Time in seconds: ",Seconds > print *, "QDP=",QDP > print *, "Sum=",Sum > > end Program Qtime > > > Output for: integer, parameter :: QDP = selected_real_kind(15) > > Time in seconds: 0.5800781 > QDP= 8 > Sum= 7.500000080627340E+015 > Why doesn't double precision produce a better result here? > Output for: integer, parameter :: QDP = selected_real_kind(30) > Time in seconds: 5.750000 > QDP= 16 > Sum= 7500000075000000.00000000000000000 > > On Tue, 28 Mar 2006 14:20:04 GMT, Tim Prince > <tprince@nospamcomputer.org> wrote: > > -|Bernhard Enders wrote: > <snip> -- *********** To reply by e-mail, make w single in address ************** |
|
|||
|
Ian Gay <gay@sfuu.ca> wrote:
> Herman D. Knoble <SkipKnobleLESS@SPAMpsu.DOT.edu> wrote in > news:4dni22t33u3i6u6ctruig5iltu9he5d6im@4ax.com: .... > > Do I=1,100000000 > > sum=sum+I*x > > end do .... > > Sum= 7.500000080627340E+015 > > Why doesn't double precision produce a better result here? Why should it produce a better result? If my quick check is correct (and I might have misssed because it isn't very far off and my check was pretty hasty) this goes past the limits for which IEEE double gives perfect results. Therefore, you'll have round-off errors in the addition. And since there are quite a lot of additions (100 million of them), those round-off errors can work their way up by quite a few bits from the low order one. Are you perhaps assuming that just because double has about 15 digits of precision, that you can count on roundoff always staying down in the bottom few bits? If so, I suggest reading up on the subject of numerical instability. Sounds to me like just a typical case of life in the real world of floatting point arithmetic. -- Richard Maine | Good judgment comes from experience; email: my first.last at org.domain| experience comes from bad judgment. org: nasa, domain: gov | -- Mark Twain |
|
|||
|
On Tue, 28 Mar 2006 09:54:09 -0800, nospam@see.signature (Richard E Maine) wrote:
-|Ian Gay <gay@sfuu.ca> wrote: -| -|> Herman D. Knoble <SkipKnobleLESS@SPAMpsu.DOT.edu> wrote in -|> news:4dni22t33u3i6u6ctruig5iltu9he5d6im@4ax.com: -|... -|> > Do I=1,100000000 -|> > sum=sum+I*x -|> > end do -|... -|> > Sum= 7.500000080627340E+015 -|> -|> Why doesn't double precision produce a better result here? -| -|Why should it produce a better result? If my quick check is correct (and -|I might have misssed because it isn't very far off and my check was -|pretty hasty) this goes past the limits for which IEEE double gives -|perfect results. Therefore, you'll have round-off errors in the -|addition. And since there are quite a lot of additions (100 million of -|them), those round-off errors can work their way up by quite a few bits -|from the low order one. Richard: As always, thank you. Ian, here are some additional notes and code that you may wish to check out that may help you with this and other more complex cases. First, I completely agree with Richard's analysis, namely that such sums are not the best numerical computations. But, a loop like sum=0 delta=(a fraction not representble in binary, like .1 for example) do i=1,n sum=sum+delta !method 1 end do will accumulate the representation error (of delta). From long time experience we know that representation error (of decimal fractions) can also be magnified by subtracting two nearly equal quantities; the most significant digits cancel during the subtraction, where the least significant digits (where variaous numerical errors can be) become the most significant digits. The example (program fuzztest) http://ftp.cac.psu.edu/pub/ger/fortran/hdk/eps.f90 and http://ftp.cac.psu.edu/pub/ger/fortran/hdk/example1.txt illustrate this somewhat dramatically. The loop do i=1,n sum=sum+I*delta ! method 2 end do may have round off error but at least will not maximize the effect of (accumulate) representation error as the above method 1 loop does. There's also a better way to sum as Kahn (and Giles) point out at: http://ftp.cac.psu.edu/pub/ger/fortran/hdk/KahnSum.f90 -| -|Are you perhaps assuming that just because double has about 15 digits of -|precision, that you can count on roundoff always staying down in the -|bottom few bits? If so, I suggest reading up on the subject of numerical -|instability. -| -|Sounds to me like just a typical case of life in the real world of -|floatting point arithmetic. I agree with Richard here also. I included displaying the sum realizing that roundoff would likely happen. The real purpose for the posting was to illustrate the order of magnitude difference in computation time between Double and Quadruple precision, using the Intel compiler which supports Real*16 (and Complex*32). I'd guess that a compiler can do a quad software implementa tion faster than using Quad.f90: http://users.bigpond.net.au/amiller/quad.html which uses Fortran derived Type (quad). All the best. Skip Knoble |
|
|||
|
nospam@see.signature (Richard E Maine) wrote in
news:1hcwlu7.17lxrqj1vfm668N%nospam@see.signature: > Ian Gay <gay@sfuu.ca> wrote: > >> Herman D. Knoble <SkipKnobleLESS@SPAMpsu.DOT.edu> wrote in >> news:4dni22t33u3i6u6ctruig5iltu9he5d6im@4ax.com: > ... >> > Do I=1,100000000 >> > sum=sum+I*x >> > end do > ... >> > Sum= 7.500000080627340E+015 >> >> Why doesn't double precision produce a better result here? > > Why should it produce a better result? If my quick check is > correct (and I might have misssed because it isn't very far off > and my check was pretty hasty) this goes past the limits for which > IEEE double gives perfect results. Therefore, you'll have > round-off errors in the addition. And since there are quite a lot > of additions (100 million of them), those round-off errors can > work their way up by quite a few bits from the low order one. > Recall that x had the value 1.5. I was thinking that since 1.5 is exactly representable in binary floating point, (and assuming a good compiler would represent it exactly :-)) that the errors would be much smaller than this. On further thought, I see that this is a (carefully constructed?) pathological example of errors from denormazlization of the smaller argument to a floating add. For amusement: (qt is the op's program set for double precision) (Windows xp on athlon) C:\source\test>g95 -o qt qt.f95 C:\source\test>qt Time in seconds: 0.5781 QDP= 8 Sum= 7.5000000806273400D+15 C:\source\test>g95 -O1 -o qt qt.f95 C:\source\test>qt Time in seconds: 0.2187 QDP= 8 Sum= 7.5000000750000000D+15 If you don't specify optimization, g95 loads and stores SUM each cycle. If you optimize, it's kept in the 80-bit floating point stack, so you get the advantage of the longer accumulator, as well as the speedup. (Unless you're foolish enough to force sse2 evaluation). > Are you perhaps assuming that just because double has about 15 > digits of precision, that you can count on roundoff always staying > down in the bottom few bits? If so, I suggest reading up on the > subject of numerical instability. > > Sounds to me like just a typical case of life in the real world of > floatting point arithmetic. > -- *********** To reply by e-mail, make w single in address ************** |
|
|||
|
Ian Gay <gay@sfuu.ca> wrote:
> nospam@see.signature (Richard E Maine) wrote in > news:1hcwlu7.17lxrqj1vfm668N%nospam@see.signature: > > Why should it produce a better result? If my quick check is > > correct (and I might have misssed because it isn't very far off > > and my check was pretty hasty) this goes past the limits for which > > IEEE double gives perfect results... > Recall that x had the value 1.5. > I was thinking that since 1.5 is exactly representable in binary... Just because 1.5 is exactly representable in IEEE double does not mean that all the numbers involved in the calculation are. Specifically, these numbers get big enough that sum is not exactly representable. But it sounds like you now see that. > On > further thought, I see that this is a (carefully constructed?) > pathological example of errors from denormazlization of the smaller > argument to a floating add. While I think you' have the facts right, I'm not so sure about your evaluation of it as being carefully constructed or pathological. I'd say it was much more like typical of floatting point roundoff issues. -- Richard Maine | Good judgement comes from experience; email: last name at domain . net | experience comes from bad judgement. domain: summertriangle | -- Mark Twain |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|