Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.* 1 > Newsgroup comp.lang.ada

Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 06-29-2012, 09:17 AM
Keean Schupke
Guest
 
Posts: n/a
Default GNAT (GCC) Profile Guided Compilation

I am developing a Monte-Carlo simulation engine, and I have implementationsin C++ and Ada. All performance figures come from the same reference hardware. Both are built using GCC, and similar optimisation is applied to each (same optimisation level, cpu arch, inlining etc).

My C++ implementation achieves 40,000 simulations per second.

My Ada port initially achieves 32,000 iterations per second.

I refactored the Ada port to use System.Address to access the main data structure (as in the 'C++' version benchmarking showed a clear performance benefit to *ptr_x compared to array[x])

The Ada port using System.Address achieves 40,000 simulations per second - so this is looking good.

However, when I use profile guided compilation, the C++ code performance leaps to 56,000 simulations per second, whereas the Ada only achieves 44,000 simulations per second.

Anyone have any ideas why profile guided compilation performs so poorly with Ada compared to C++? The initial with normal compilation seemed very promising.


Cheers,
Keean.
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #2 (permalink)  
Old 06-29-2012, 09:34 AM
Dmitry A. Kazakov
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

On Fri, 29 Jun 2012 02:17:19 -0700 (PDT), Keean Schupke wrote:

> Anyone have any ideas why profile guided compilation performs so poorly
> with Ada compared to C++?


Probably because you programmed it using low level C-ish stuff like
addresses instead of accessing data directly. In general, if you write a C
program in Ada (or any other language), you cannot expect it do better than
in C. To beat C, the precondition is to design it in Ada way.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
Reply With Quote
  #3 (permalink)  
Old 06-29-2012, 10:01 AM
Keean Schupke
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

On Friday, 29 June 2012 10:34:19 UTC+1, Dmitry A. Kazakov wrote:
> On Fri, 29 Jun 2012 02:17:19 -0700 (PDT), Keean Schupke wrote:
>
> > Anyone have any ideas why profile guided compilation performs so poorly
> > with Ada compared to C++?

>
> Probably because you programmed it using low level C-ish stuff like
> addresses instead of accessing data directly. In general, if you write a C
> program in Ada (or any other language), you cannot expect it do better than
> in C. To beat C, the precondition is to design it in Ada way.
>
> --
> Regards,
> Dmitry A. Kazakov
> http://www.dmitry-kazakov.de


I don't think you are right here, but I would be happy to be proved wrong.

In general I find the choice of algorithm dominates performance. In this case most of the performance is dominated by a disjoint-set (union-find) algorithm operating on an array of data. The data nominally represents a 2D space (although I am using a 1D array to avoid the cost of multiplication on every access).

Most operations access the neighbouring cells, hence require a +/- one cell, or +/- one row operations.

Maintaining the address of the current cell is faster than using indexed addressing (mainly because of the extra parameter that needs to be passed around increasing the pressure on register allocation).

I would be interested to understand how an algorithm can be implemented in a more Ada way?


Cheers,
Keean.
Reply With Quote
  #4 (permalink)  
Old 06-29-2012, 10:24 AM
Keean Schupke
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

On Friday, 29 June 2012 11:01:30 UTC+1, Keean Schupke wrote:
> On Friday, 29 June 2012 10:34:19 UTC+1, Dmitry A. Kazakov wrote:
> > On Fri, 29 Jun 2012 02:17:19 -0700 (PDT), Keean Schupke wrote:
> >
> > > Anyone have any ideas why profile guided compilation performs so poorly
> > > with Ada compared to C++?

> >
> > Probably because you programmed it using low level C-ish stuff like
> > addresses instead of accessing data directly. In general, if you write a C
> > program in Ada (or any other language), you cannot expect it do better than
> > in C. To beat C, the precondition is to design it in Ada way.
> >
> > --
> > Regards,
> > Dmitry A. Kazakov
> > http://www.dmitry-kazakov.de

>
> I don't think you are right here, but I would be happy to be proved wrong..
>
> In general I find the choice of algorithm dominates performance. In this case most of the performance is dominated by a disjoint-set (union-find) algorithm operating on an array of data. The data nominally represents a 2D space (although I am using a 1D array to avoid the cost of multiplication onevery access).
>
> Most operations access the neighbouring cells, hence require a +/- one cell, or +/- one row operations.
>
> Maintaining the address of the current cell is faster than using indexed addressing (mainly because of the extra parameter that needs to be passed around increasing the pressure on register allocation).
>
> I would be interested to understand how an algorithm can be implemented in a more Ada way?
>
>
> Cheers,
> Keean.


To facilitate the discussion I will post the core packages from the implementation. First the slower version using array indexing (this is the 32k simulations per second version). As my application is lookup dominated, this algorithm (the eager disjoint set algorithm) is faster than the WQUPC (Weighted Quick Union with Path Compression) algorithm for my application.



-- (C)2011 Keean Schupke, all rights reserved.
generic
type Element_Type is limited private;
type Element_Access is not null access all Element_Type;
type Set_Index is range <>;
package Eager_Disjointsets is
type Set_Type is limited private;
type Set_Access is not null access all Set_Type;
function Find(
Set : in Set_Access;
Index : in Set_Index
) return Set_Index;
pragma Inline_Always(Find);

procedure Makeset(
Set : in Set_Access;
Index : in Set_Index
);
pragma Inline_Always(Makeset);

procedure Link(
Set : in Set_Access;
Left, Right : in out Set_Index
);
pragma Inline_Always(Link);

function Get(
Set : in Set_Access;
Index : in Set_Index
) return Element_Access;
pragma Inline_Always(Get);

function Next(
Set : in Set_Access;
Index : in Set_Index
) return Set_Index;
pragma Inline_Always(Next);

procedure Set_Next(
Set : in Set_Access;
Index : in Set_Index;
Next : in Set_Index
);
pragma Inline_Always(Set_Next);

private
type Node_Type is limited record
Canonical : Set_Index;
Successor : Set_Index;
Size : Natural;
Value : aliased Element_Type;
end record;
type Set_Type is array (Set_Index) of Node_Type;
end Eager_Disjointsets;

package body Eager_Disjointsets is
procedure Makeset(
Set : in Set_Access;
Index : in Set_Index
) is begin
Set(Index).Canonical := Index;
Set(Index).Size := 1;
end Makeset;

function Find(
Set : in Set_Access;
Index : in Set_Index
) return Set_Index is begin
return Set(Index).Canonical;
end Find;

procedure Swap_Indexes is new Swap(Set_Index);

procedure Link(
Set : in Set_Access;
Left, Right : in out Set_Index) is
begin
if Set(Left).Size <= Set(Right).Size then
Swap_Indexes(Left, Right);
end if;

declare
Index : Set_Index := Right;
begin
Link_Loop : loop
Set(Index).Canonical := Left;
Index := Set(Index).Successor;
exit Link_Loop when Index = Right;
end loop Link_Loop;
end;

Add(Set(Left).Size, Set(Right).Size);

Swap_Indexes(Set(Left).Successor, Set(Right).Successor);
end Link;

function Get(
Set : in Set_Access;
Index : in Set_Index
) return Element_Access is begin
return Set(Index).Value'Access;
end Get;

function Next(
Set : in Set_Access;
Index : in Set_Index
) return Set_Index is begin
return Set(Index).Successor;
end Next;

procedure Set_Next(
Set : in Set_Access;
Index : in Set_Index;
Next : in Set_Index
) is begin
Set(Index).Successor := Next;
end Set_Next;
end Eager_Disjointsets;



And this is the faster 40k simulations per second version:



-- (C)2011 Keean Schupke, all rights reserved.
generic
Size : Int;
type Element_Type is limited private;
type Element_Access is not null access all Element_Type;
package Eager_Disjointsets is
type Set_Type is limited private;
type Set_Access is not null access all Set_Type;
type Node_Type is limited private;
type Cursor is access all Node_Type;

function Find(Node : in Cursor) return Cursor;
pragma Inline_Always(Find);
procedure Makeset(Node : in Cursor);
pragma Inline_Always(Makeset);
procedure Link(Left, Right : in out Cursor);
pragma Inline_Always(Link);

function Succ(Node : in Cursor) return Cursor;
pragma Inline_Always(Succ);
procedure Set_Succ(Node, Succ : in Cursor);
pragma Inline_Always(Set_Succ);

function First(Set : in Set_Access) return Cursor;
pragma Inline_Always(First);
function Element(Node : in Cursor) return Element_Access;
pragma Inline_Always(Element);
function Index(Set : in Set_Access; Node : in Cursor) return Int;
pragma Inline_Always(Index);
procedure Next(Node : in out Cursor);
pragma Inline_Always(Next);

function Step return Int;
pragma Inline_Always(Step);
function "+" (Left : in Cursor; Right : in Int) return Cursor;
pragma Inline_Always("+");
function "-" (Left : in Cursor; Right : in Int) return Cursor;
pragma Inline_Always("-");
private
type Node_Type is limited record
Canonical : Cursor;
Successor : Cursor;
Size : Int;
Value : aliased Element_Type;
end record;
type Set_Type is array (0 .. Size - 1) of aliased Node_Type;

package Node_Address is new System.Address_To_Access_Conversions(Node_Type);

Node_Size : Storage_Offset := Node_Type'Object_Size
/ System.Storage_Elements.Storage_Element'Size;
end Eager_Disjointsets;

package body Eager_Disjointsets is
function To_Cursor(A : System.Address) return Cursor is
begin
return Cursor(Node_Address.To_Pointer(A));
end To_Cursor;
pragma Inline_Always(To_Cursor);

function To_Address(C : Cursor) return System.Address is
begin
return Node_Address.To_Address(Node_Address.Object_Pointe r(C));
end To_Address;
pragma Inline_Always(To_Address);

procedure Makeset(
Node : in Cursor
) is begin
Node.Canonical := Node;
Node.Size := 1;
end Makeset;

function Find(
Node : in Cursor
) return Cursor is begin
return Cursor(Node.Canonical);
end Find;

procedure Swap_Cursors is new Swap(Cursor);

procedure Link(
Left, Right : in out Cursor) is
begin
if Left.Size <= Right.Size then
Swap_Cursors(Cursor(Left), Cursor(Right));
end if;

declare
Node : Cursor := Right;
begin
Link_Loop : loop
Node.Canonical := Left;
Node := Cursor(Node.Successor);
exit Link_Loop when Node = Right;
end loop Link_Loop;
end;

Add(Left.Size, Right.Size);
Swap_Cursors(Cursor(Left.Successor), Cursor(Right.Successor));
end Link;

function Succ(
Node : in Cursor
) return Cursor is begin
return Cursor(Node.Successor);
end Succ;
procedure Set_Succ(
Node, Succ : in Cursor
) is begin
Node.Successor := Succ;
end Set_Succ;

function First(
Set : in Set_Access
) return Cursor is begin
return Set(Set_Type'First)'Access;
end First;

function Element(
Node : in Cursor
) return Element_Access is begin
return Node.Value'Access;
end Element;

function Index(
Set : in Set_Access;
Node : in Cursor
) return Int is begin
return Int((To_Address(Node) - To_Address(First(Set))) / Node_Size);
end Index;

procedure Next(
Node : in out Cursor
) is begin
Node := To_Cursor(To_Address(Node) + Node_Size);
end Next;

function Step return Int is
begin
return Int(Node_Size);
end Step;

function "+" (Left : in Cursor; Right : in Int) return Cursor is
begin
return To_Cursor(To_Address(Left) + Storage_Offset(Right));
end "+";

function "-" (Left : in Cursor; Right : in Int) return Cursor is
begin
return To_Cursor(To_Address(Left) - Storage_Offset(Right));
end "-";
end Eager_Disjointsets;



Would appreciate any advice on how this could be implemented in a more Ada friendly way, or any general performance tips.



Cheers,
Keean.
Reply With Quote
  #5 (permalink)  
Old 06-29-2012, 10:48 AM
Simon Wright
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

Keean Schupke <keean.schupke@googlemail.com> writes:

> I am developing a Monte-Carlo simulation engine, and I have
> implementations in C++ and Ada. All performance figures come from the
> same reference hardware. Both are built using GCC, and similar
> optimisation is applied to each (same optimisation level, cpu arch,
> inlining etc).
>
> My C++ implementation achieves 40,000 simulations per second.
>
> My Ada port initially achieves 32,000 iterations per second.
>
> I refactored the Ada port to use System.Address to access the main
> data structure (as in the 'C++' version benchmarking showed a clear
> performance benefit to *ptr_x compared to array[x])
>
> The Ada port using System.Address achieves 40,000 simulations per
> second - so this is looking good.


Did you try -gnatp (suppress all checks)? (only to be used when you're
sure the code is correct!)

You could also use pragma Suppress at the hot spots.
Reply With Quote
  #6 (permalink)  
Old 06-29-2012, 12:05 PM
Dmitry A. Kazakov
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

On Fri, 29 Jun 2012 03:01:30 -0700 (PDT), Keean Schupke wrote:

> On Friday, 29 June 2012 10:34:19 UTC+1, Dmitry A. Kazakov wrote:
>> On Fri, 29 Jun 2012 02:17:19 -0700 (PDT), Keean Schupke wrote:
>>
>>> Anyone have any ideas why profile guided compilation performs so poorly
>>> with Ada compared to C++?

>>
>> Probably because you programmed it using low level C-ish stuff like
>> addresses instead of accessing data directly. In general, if you write a C
>> program in Ada (or any other language), you cannot expect it do better than
>> in C. To beat C, the precondition is to design it in Ada way.


> In general I find the choice of algorithm dominates performance. In this
> case most of the performance is dominated by a disjoint-set (union-find)
> algorithm operating on an array of data. The data nominally represents a
> 2D space (although I am using a 1D array to avoid the cost of
> multiplication on every access).


That still does not justify machine addresses or pointers to limited
targets.

I would not be so sure that access to 2D array is slower when done by the
compiler. That depends on too many factors.

I also do not know what effect has profiling on inlining and optimizations
of generic instances in presence of numerous very short subprograms. I
guess that should severely distort the picture, maybe beyond recognition. I
must admit it, I never used gcc profiling any seriously, instead, I always
did direct time measures.

If performance is OK without profiling, why do you care?

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
Reply With Quote
  #7 (permalink)  
Old 06-29-2012, 12:26 PM
stefan-lucks@see-the.signature
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

On Fri, 29 Jun 2012, Keean Schupke wrote:

> On Friday, 29 June 2012 11:01:30 UTC+1, Keean Schupke wrote:


> > Maintaining the address of the current cell is faster than using
> > indexed addressing (mainly because of the extra parameter that needs
> > to be passed around increasing the pressure on register allocation).


I guess you did some benchmarking, regarding that issue?

> > I would be interested to understand how an algorithm can be
> > implemented in a more Ada way?


Faster or slower, you are extensively using access types (Element_Access
and Set_Access). This is something Ada programmers usually try to avoid,
except when there is an urgent need for some dynamic memory management
(e.g., implementing one's own Container library).

> generic
> type Element_Type is limited private;
> type Element_Access is not null access all Element_Type;
> type Set_Index is range <>;
> package Eager_Disjointsets is
> type Set_Type is limited private;
> type Set_Access is not null access all Set_Type;


What, exactly, do you use Element_Access for? And why, exactly, do you
bother to define Set_Access?

> function Find(
> Set : in Set_Access;
> Index : in Set_Index
> ) return Set_Index;
> pragma Inline_Always(Find);


Why don't you just write the following? (Same for your other operations.)

function find
(Set : in Set_Type;
Index : in Set_Index) return Set_Index;

[...]

> private
> type Node_Type is limited record
> Canonical : Set_Index;
> Successor : Set_Index;
> Size : Natural;
> Value : aliased Element_Type;
> end record;
> type Set_Type is array (Set_Index) of Node_Type;
> end Eager_Disjointsets;


Why, exactly, do you make Value aliased? Is this related to your "faster
than indexed addressing" point above?

This may be hard for the compiler to properly optimize: An aliased
component Element_Type in a record ... and then there is an array of such
records, and all is inside a generic package.

If you really need Value to be aliased (do you?), then you can try out
rewriting Set_Type:

type Set_Index_Array is array(Set_Index) of Set_Index;
type Size_Array is array(Set_Index) of Natural;
type Value_Array is array(Set_Index) of aliased Element_Type;

type Set_Type is record
Canonical, Successor: Set_Index_Array;
Size: Size_Array;
Value: Value_Array;
end record;

I have no idea if the above modification helps to speed up the
computation, or not. It may backfire, I don't know.

> package body Eager_Disjointsets is
> function Find(
> Set : in Set_Access;
> Index : in Set_Index
> ) return Set_Index is begin
> return Set(Index).Canonical;
> end Find;


Now, here you would have to write "return Set.Canonical(Index);"

> -- (C)2011 Keean Schupke, all rights reserved.
> generic
> Size : Int;
> type Element_Type is limited private;
> type Element_Access is not null access all Element_Type;
> package Eager_Disjointsets is
> type Set_Type is limited private;
> type Set_Access is not null access all Set_Type;
> type Node_Type is limited private;
> type Cursor is access all Node_Type;


Is Int another name for Integer?

I understand that for performance reason you introduced Cursor as access
all Node_Type and later Set_Array as aliased array of Node_Type.

But again, what do you need Element_Access and Set_Access for?

> package Node_Address is new
> System.Address_To_Access_Conversions(Node_Type);


Probably, the compiler finds that out by itself -- but you may want to
make it explicit that Node_Size doesn't change:

> Node_Size : constant Storage_Offset := Node_Type'Object_Size
> / System.Storage_Elements.Storage_Element'Size;


Also, if Node_Size isn't a power of 2, it may be difficult for the
compiler to decide on the alignment of objects of type Node_Type -- with
possible effects on caching. On one hand, it may be good to pack the
objects closely together, to make as many fit into a single cache line as
possible. On the other hand, it is bad if a part of a certain node is in
one cache line, another part in another cache line.

> function Find(
> Node : in Cursor
> ) return Cursor is begin
> return Cursor(Node.Canonical);
> end Find;


Not related to performance, but why don't you just write
"return Node.Canonical;" as Node.Canonical is of type Cursor, anyway?


--
---- Stefan.Lucks (at) uni-weimar.de, University of Weimar, Germany ----
<http://www.uni-weimar.de/cms/medien/mediensicherheit/home.html>
------ I love the taste of Cryptanalysis in the morning! ------

Reply With Quote
  #8 (permalink)  
Old 06-29-2012, 12:39 PM
gautier_niouzes@hotmail.com
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

Hello,

Did you try with integer array indices instead of Access or System.Address ?
It is possible that you make some counter-productive manual optimization.

If you give some reference about the algorithm I am tempted to give an help, it seems a very exciting area to learn...

Cheers
_________________________
Gautier's Ada programming
http://gautiersblog.blogspot.com/search/label/Ada
NB: follow the above link for a valid e-mail address
Reply With Quote
  #9 (permalink)  
Old 06-29-2012, 12:52 PM
Keean Schupke
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

On Friday, 29 June 2012 13:39:47 UTC+1, (unknown) wrote:
> Hello,
>
> Did you try with integer array indices instead of Access or System.Address ?
> It is possible that you make some counter-productive manual optimization.
>
> If you give some reference about the algorithm I am tempted to give an help, it seems a very exciting area to learn...
>
> Cheers
> _________________________
> Gautier's Ada programming
> http://gautiersblog.blogspot.com/search/label/Ada
> NB: follow the above link for a valid e-mail address


First version I posted uses Integer indexes.

Cheers,
Keean.
Reply With Quote
  #10 (permalink)  
Old 06-29-2012, 02:14 PM
gautier_niouzes@hotmail.com
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

> First version I posted uses Integer indexes.

OK... I meant something more radical (was actually discussed in another thread): getting rid of all pointers. I.e. (I borrow from the thread in question ;-) ):
function find
(Set : in Set_Type;
Index : in Set_Index) return Set_Index;
GNAT is smart enough to select the pass-by-reference method.
In a similar experiment (a BZip2 decompressor) the pointer-free version was the fastest. Of course, no guarantee it will happen in your case... But it is worth a try.
G.
Reply With Quote
  #11 (permalink)  
Old 06-29-2012, 03:05 PM
gautier_niouzes@hotmail.com
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

More specifically, this:

--8<------8<------8<----
-- (C)2011 Keean Schupke, all rights reserved.
generic
type Element_Type is private;
-- Element_Type is a small object; could be also eventually an index
-- for an array of larger objects, or an access type to a larger object
type Set_Index is range <>;
package Eager_Disjointsets is
type Set_Type is limited private;
function Find(
Set : in Set_Type;
Index : in Set_Index
) return Set_Index;
pragma Inline_Always(Find);

procedure Makeset(
Set : in out Set_Type;
Index : in Set_Index
);
pragma Inline_Always(Makeset);

procedure Link(
Set : in out Set_Type;
Left, Right : in out Set_Index
);
pragma Inline_Always(Link);

function Get(
Set : in Set_Type;
Index : in Set_Index
) return Element_Type;
pragma Inline_Always(Get);

function Next(
Set : in Set_Type;
Index : in Set_Index
) return Set_Index;
pragma Inline_Always(Next);

procedure Set_Next(
Set : in out Set_Type;
Index : in Set_Index;
Next : in Set_Index
);
pragma Inline_Always(Set_Next);

private
type Node_Type is limited record
Canonical : Set_Index;
Successor : Set_Index;
Size : Natural;
Value : aliased Element_Type;
end record;
type Set_Type is array (Set_Index) of Node_Type;
end Eager_Disjointsets;

package body Eager_Disjointsets is

-- Ad-hoc added by GdM for compiling the package
generic
type Element is private;
procedure Swap(a, b: in out Element);
pragma Inline(Swap);
procedure Swap(a, b: in out Element) is
c: Element;
begin
c:= a;
a:= b;
b:= c;
end Swap;

-- Ad-hoc added by GdM for compiling the package
procedure Add(to: in out Natural; what: Natural) is
pragma Inline(Add);
begin
to:= to + what;
end Add;

procedure Makeset(
Set : in out Set_Type;
Index : in Set_Index
) is
begin
Set(Index).Canonical := Index;
Set(Index).Size := 1;
end Makeset;

function Find(
Set : in Set_Type;
Index : in Set_Index
) return Set_Index is begin
return Set(Index).Canonical;
end Find;

procedure Swap_Indexes is new Swap(Set_Index);

procedure Link(
Set : in out Set_Type;
Left, Right : in out Set_Index) is
begin
if Set(Left).Size <= Set(Right).Size then
Swap_Indexes(Left, Right);
end if;

declare
Index : Set_Index := Right;
begin
Link_Loop : loop
Set(Index).Canonical := Left;
Index := Set(Index).Successor;
exit Link_Loop when Index = Right;
end loop Link_Loop;
end;

Add(Set(Left).Size, Set(Right).Size);

Swap_Indexes(Set(Left).Successor, Set(Right).Successor);
end Link;

function Get(
Set : in Set_Type;
Index : in Set_Index
) return Element_Type is begin
return Set(Index).Value;
end Get;

function Next(
Set : in Set_Type;
Index : in Set_Index
) return Set_Index is begin
return Set(Index).Successor;
end Next;

procedure Set_Next(
Set : in out Set_Type;
Index : in Set_Index;
Next : in Set_Index
) is begin
Set(Index).Successor := Next;
end Set_Next;
end Eager_Disjointsets;
Reply With Quote
  #12 (permalink)  
Old 06-29-2012, 05:03 PM
Keean Schupke
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

On Friday, 29 June 2012 16:05:07 UTC+1, (unknown) wrote:
> More specifically, this:
>
> --8<------8<------8<----
> -- (C)2011 Keean Schupke, all rights reserved.
> generic
> type Element_Type is private;
> -- Element_Type is a small object; could be also eventually an index
> -- for an array of larger objects, or an access type to a larger object
> type Set_Index is range <>;
> package Eager_Disjointsets is
> type Set_Type is limited private;
> function Find(
> Set : in Set_Type;
> Index : in Set_Index
> ) return Set_Index;
> pragma Inline_Always(Find);
>
> procedure Makeset(
> Set : in out Set_Type;
> Index : in Set_Index
> );
> pragma Inline_Always(Makeset);
>
> procedure Link(
> Set : in out Set_Type;
> Left, Right : in out Set_Index
> );
> pragma Inline_Always(Link);
>
> function Get(
> Set : in Set_Type;
> Index : in Set_Index
> ) return Element_Type;
> pragma Inline_Always(Get);
>
> function Next(
> Set : in Set_Type;
> Index : in Set_Index
> ) return Set_Index;
> pragma Inline_Always(Next);
>
> procedure Set_Next(
> Set : in out Set_Type;
> Index : in Set_Index;
> Next : in Set_Index
> );
> pragma Inline_Always(Set_Next);
>
> private
> type Node_Type is limited record
> Canonical : Set_Index;
> Successor : Set_Index;
> Size : Natural;
> Value : aliased Element_Type;
> end record;
> type Set_Type is array (Set_Index) of Node_Type;
> end Eager_Disjointsets;
>
> package body Eager_Disjointsets is
>
> -- Ad-hoc added by GdM for compiling the package
> generic
> type Element is private;
> procedure Swap(a, b: in out Element);
> pragma Inline(Swap);
> procedure Swap(a, b: in out Element) is
> c: Element;
> begin
> c:= a;
> a:= b;
> b:= c;
> end Swap;
>
> -- Ad-hoc added by GdM for compiling the package
> procedure Add(to: in out Natural; what: Natural) is
> pragma Inline(Add);
> begin
> to:= to + what;
> end Add;
>
> procedure Makeset(
> Set : in out Set_Type;
> Index : in Set_Index
> ) is
> begin
> Set(Index).Canonical := Index;
> Set(Index).Size := 1;
> end Makeset;
>
> function Find(
> Set : in Set_Type;
> Index : in Set_Index
> ) return Set_Index is begin
> return Set(Index).Canonical;
> end Find;
>
> procedure Swap_Indexes is new Swap(Set_Index);
>
> procedure Link(
> Set : in out Set_Type;
> Left, Right : in out Set_Index) is
> begin
> if Set(Left).Size <= Set(Right).Size then
> Swap_Indexes(Left, Right);
> end if;
>
> declare
> Index : Set_Index := Right;
> begin
> Link_Loop : loop
> Set(Index).Canonical := Left;
> Index := Set(Index).Successor;
> exit Link_Loop when Index = Right;
> end loop Link_Loop;
> end;
>
> Add(Set(Left).Size, Set(Right).Size);
>
> Swap_Indexes(Set(Left).Successor, Set(Right).Successor);
> end Link;
>
> function Get(
> Set : in Set_Type;
> Index : in Set_Index
> ) return Element_Type is begin
> return Set(Index).Value;
> end Get;
>
> function Next(
> Set : in Set_Type;
> Index : in Set_Index
> ) return Set_Index is begin
> return Set(Index).Successor;
> end Next;
>
> procedure Set_Next(
> Set : in out Set_Type;
> Index : in Set_Index;
> Next : in Set_Index
> ) is begin
> Set(Index).Successor := Next;
> end Set_Next;
> end Eager_Disjointsets;


I will give it a go, but if you notice my implementation puts the Element_Type into the Node structure itself rather than having a pointer (Access) toa separate object. In C++ this node is 20 bytes (8 each for the canonical and successor pointers, and 4 for the subset-size). The 'Element' I am using is 12 bytes (a record of two integers and and array of 4 bytes). In my tests in C++ a single array of 32 bytes packing the payload into the node using templates (or generics in Ada) outperformed two lists, as there was one less pointer dereference. Here I think the performance is dominated by the cache behaviour, so I don't expect this change to be faster, but I will give it a go.

Note: this is not really answering my question about the lack of improvement due to profile-guided compilation - but perhaps that side of things is more a compiler issue. Does anyone have any ideas about that side of things?


Cheers,
Keean.
Reply With Quote
  #13 (permalink)  
Old 07-01-2012, 09:29 AM
Georg Bauhaus
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

On 29.06.12 19:03, Keean Schupke wrote:
> Note: this is not really answering my question about the lack of improvement due to profile-guided compilation - but perhaps that side of things is more a compiler issue. Does anyone have any ideas about that side of things?



In the few cases I have tried -fprofile-generate and -profile-use,
there were more or less moderate improvements.

For example, a program that writes a mandelbrot set executing on
4 cores runs in (user/real) 1.3s/10.0s and in 1.2s/9.3s, resp.

Other things of interest included (perhaps nothing new here)
preventing inline for some subprograms where inlining it would
have had adverse effects. Maybe it was steering the optimizers in
the wrong direction, preventing some better uses of registers.

Reply With Quote
  #14 (permalink)  
Old 07-01-2012, 05:45 PM
Georg Bauhaus
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

On 29.06.12 19:03, Keean Schupke wrote:
> Note: this is not really answering my question about the lack of improvement due to profile-guided compilation - but perhaps that side of things is more a compiler issue. Does anyone have any ideas about that side of things?


Some more observations, collected with the help of the simple
two example programs appended below.

For Ada and GNAT at least, any advantages to be had from profile-guided
compilation seem to vary with optimization options and sizes of data.
The same is true about a 1D approach vs a 2D approach.

Among the various arrangements, one of the best I have got
from both GNAT GPL 2012, and from an FSF GNAT as of June 2012
uses
*) -O2 -funroll-loops -gnatp
*) the 2D approach

(If it means a anything, adding -ftree-vectorize to the -O2 set
produces only the second best Ada result, the same as the -O3 times
listed below. I had chosen -ftree-vectorize as this switch is in the
O3 set.) Trying profile guided compilation at higher optimization
levels consistently slowed the Ada program down (other Ada programs
were faster after profile guided compilation, as mentioned elsewhere).

The result at -O2 seems nice, though, in that the 2D approach is
natural, the compiler switches are the ones that the manual recommends.
and this combination produces the fastest Ada program.

In short, from worst to best 1D / 2D, Runs = 100:

Ada profile guided -> .88 / .80 (this program only!)
Ada at -O3 -> .68 / .66
C++ at -O3 -> .66
Ada at -O2 -funr.. -> .68 / .47
C++ profile guided -> .31


Some differences seem to vary with hardware, too.
On one computer I have seen a minor speedup, on another
a very small slowdown, and a different ratio of 1D / 2D.

For the C++ case I have tested a 1D approach only, not
knowing how to write 2D arrays in C++ using C style arrays.
I'm not a C++ programmer, apologies for my C++.


Ada at -O2 -funroll-loops -gnatp -mtune=native
1D: 0.680750000 27303
2D: 0.469094000 27303 <-- Best for Ada!

Ada at -O3 -gnatNp -fomit-frame-pointer -mtune=native
1D: 0.676616000 27303
2D: 0.664194000 27303

previous with profile-guided compilation
1D: 0.888206000 27303
2D: 0.806196000 27303

C++ at -O3 -mtune=native
1D: 0.681721 0

previous with profile-guided compilation
1D: 0.31165 0

Clang++
1D: 0.611522 0

The GNU C++ compiler is from the GNAT GPL 2012 distribution for Mac OS X,
Clang++ is Apple's v 3.1.


=== 8< === 8< === 8< === 8< ===

package Fringes is

pragma Pure (Fringes);

type Full_Index is mod 2**32;
subtype Num is Full_Index Range 0 .. 100_000;
subtype Index is Full_Index range 0 .. 1000;
Len : constant Full_Index := Index'Last - Index'First + 1;

type Matrix_1D is
array (Index'First .. Index'First + Len * Len - 1) of Num;
type Matrix_2 is array (Index, Index) of Num;

procedure Compute_1D (A : in out Matrix_1D);
procedure Compute_2 (A : in out Matrix_2);

end Fringes;

package body Fringes is

--
-- Each Compute procedure has A pointless loop that assigns each
-- inner field the sum of non-diagonal neighbors modulo Num'Last
--

procedure NOT_USED_Compute_1D_Slow (A : in out Matrix_1D) is
J : Full_Index;
begin
J := A'First + Len;
loop
exit when J > A'Last - Len;
for K in J + 1 .. J + Len - 1 - 1 loop
A (K) := (A(K + 1)
+ A(K - Len)
+ A(K - 1)
+ A(K + Len)) mod Num'Last;
end loop;
J := J + Len;
end loop;
end NOT_USED_Compute_1D_Slow;

procedure Compute_1D (A : in out Matrix_1D) is
begin
for K in A'First + Len + 1 .. A'Last - Len - 1 loop
case K mod Len is
when 0 | Len - 1 => null;
when others =>
A (K) := (A(K + 1)
+ A(K - Len)
+ A(K - 1)
+ A(K + Len)) mod Num'Last;
end case;
end loop;
end Compute_1D;

procedure Compute_2 (A : in out Matrix_2) is
begin
for J in A'First(1) + 1 .. A'Last(1) - 1 loop
for K in A'First(2) + 1 .. A'Last(2) - 1 loop
A (J, K) := (A(J, K + 1)
+ A(j - 1, K)
+ A(J, K - 1)
+ A(j + 1, K)) mod Num'Last;
end loop;
end loop;
end Compute_2;

end Fringes;


with Fringes;
with Ada.Command_Line, Ada.Real_Time;
procedure Test_Fringes is

use Ada.Real_Time;

Runs : constant Natural :=
Natural'Value (Ada.Command_Line.Argument(1));

Start, Stop: Time;

package Os_Or_Gnat_Stacksize_Nuisance is
use Fringes;
type P1 is access Matrix_1D;
type P2 is access Matrix_2;
M1P : constant P1 := new Matrix_1D'(others => 123);
M2p : constant P2 := new Matrix_2'(Index => (Index => 123));
M1D : Matrix_1D renames M1P.all;
M2 : Matrix_2 renames M2P.all;
end Os_Or_Gnat_Stacksize_Nuisance;
use Os_Or_Gnat_Stacksize_Nuisance;

procedure Print_Timing (Part : String; N : Fringes.Num) is separate;
use type Fringes.Full_Index;
begin
Start := Clock;
for Run in 1 .. Runs loop
Fringes.Compute_1D (M1D);
end loop;
Stop := Clock;
Print_Timing ("1D", M1D ((M1D'First + 1 * Fringes.Len) + (2)));

Start := Clock;
for Run in 1 .. Runs loop
Fringes.Compute_2 (M2);
end loop;
Stop := Clock;
Print_Timing ("2D", M2 (1, 2));
end Test_Fringes;

with Ada.Text_IO;
separate (Test_Fringes)
procedure Print_Timing (Part : String; N : Fringes.Num) is
use Ada.Text_IO;
begin
Put (Part);
Put (": ");
Put (Duration'Image (To_Duration(Stop - Start)));
Put (Fringes.Num'Image (N));
New_Line;
end print_timing;


=== 8< === 8< === 8< === 8< ===

#include <stdint.h>

namespace Fringes {

typedef uint32_t Full_Index;
#define Num_Last 100000
typedef Full_Index Num;
#define Index_Last 1000
typedef Full_Index Index;
const Full_Index Len = Index_Last + 1;

#define A_Last ((Len * Len) - 1)
typedef Num Matrix_1D[A_Last + 1];

void Compute_1D (Matrix_1D& A);
// Each Compute procedure has a pointless loop that assigns each
// inner field the sum of non-diagonal neighbors modulo Num_Last
void Compute_1D (Matrix_1D& A) {

for (Full_Index K = Len + 1; K < A_Last - Len - 1; ++K) {
switch (K % Len) {
case 0: case Len - 1: break;
default:
A [K] = (A[K + 1]
+ A[K - Len]
+ A[K - 1]
+ A[K + Len]) % Num_Last;
}
}
}
}

#include <sys/time.h>
#include <string>

class Reporter {
struct timeval Start, Stop;
public:
void logStart();
void logStop();
void Print_Timing (const std::string Part, const Fringes::Num N);
};


#include <sstream>

int main(int argc, char* argv[]) {
using namespace Fringes;

int Runs;
Matrix_1D* M1D = new Matrix_1D[Len];
Reporter history;

if (argc == 0) {
throw "usage";
} else {
std::istringstream converter(argv[1]);

if (! ((converter >> Runs) && (Runs >= 0))) {
throw "argument error?";
}
}

history.logStart();
for (int Run = 1; Run <= Runs; ++Run) {
Compute_1D (*M1D);
}
history.logStop();
history.Print_Timing ("1D", (*M1D) [(1 * Len) + (2)]);
}


#include <iostream>
#include <sys/time.h>

void Reporter::Print_Timing (const std::string Part, const Fringes::Num N) {
double difference = (Stop.tv_sec - Start.tv_sec) * 1000000
+ (Stop.tv_usec - Start.tv_usec);
std::cout << Part << ": ";
std::cout << (difference / 1000000.0) << " " << N;
std::cout << '\n';
}

void Reporter::logStart() {
gettimeofday(&this->Start, 0);
}

void Reporter::logStop() {
gettimeofday(&this->Stop, 0);
}


Reply With Quote
  #15 (permalink)  
Old 07-01-2012, 10:57 PM
Keean Schupke
Guest
 
Posts: n/a
Default Re: GNAT (GCC) Profile Guided Compilation

On Sunday, 1 July 2012 18:45:22 UTC+1, Georg Bauhaus wrote:
> On 29.06.12 19:03, Keean Schupke wrote:
> > Note: this is not really answering my question about the lack of improvement due to profile-guided compilation - but perhaps that side of things is more a compiler issue. Does anyone have any ideas about that side of things?

>
> Some more observations, collected with the help of the simple
> two example programs appended below.
>
> For Ada and GNAT at least, any advantages to be had from profile-guided
> compilation seem to vary with optimization options and sizes of data.
> The same is true about a 1D approach vs a 2D approach.
>
> Among the various arrangements, one of the best I have got
> from both GNAT GPL 2012, and from an FSF GNAT as of June 2012
> uses
> *) -O2 -funroll-loops -gnatp
> *) the 2D approach
>
> (If it means a anything, adding -ftree-vectorize to the -O2 set
> produces only the second best Ada result, the same as the -O3 times
> listed below. I had chosen -ftree-vectorize as this switch is in the
> O3 set.) Trying profile guided compilation at higher optimization
> levels consistently slowed the Ada program down (other Ada programs
> were faster after profile guided compilation, as mentioned elsewhere).
>
> The result at -O2 seems nice, though, in that the 2D approach is
> natural, the compiler switches are the ones that the manual recommends.
> and this combination produces the fastest Ada program.
>
> In short, from worst to best 1D / 2D, Runs = 100:
>
> Ada profile guided -> .88 / .80 (this program only!)
> Ada at -O3 -> .68 / .66
> C++ at -O3 -> .66
> Ada at -O2 -funr.. -> .68 / .47
> C++ profile guided -> .31
>
>
> Some differences seem to vary with hardware, too.
> On one computer I have seen a minor speedup, on another
> a very small slowdown, and a different ratio of 1D / 2D.
>
> For the C++ case I have tested a 1D approach only, not
> knowing how to write 2D arrays in C++ using C style arrays.
> I'm not a C++ programmer, apologies for my C++.
>
>
> Ada at -O2 -funroll-loops -gnatp -mtune=native
> 1D: 0.680750000 27303
> 2D: 0.469094000 27303 <-- Best for Ada!
>
> Ada at -O3 -gnatNp -fomit-frame-pointer -mtune=native
> 1D: 0.676616000 27303
> 2D: 0.664194000 27303
>
> previous with profile-guided compilation
> 1D: 0.888206000 27303
> 2D: 0.806196000 27303
>
> C++ at -O3 -mtune=native
> 1D: 0.681721 0
>
> previous with profile-guided compilation
> 1D: 0.31165 0
>
> Clang++
> 1D: 0.611522 0
>
> The GNU C++ compiler is from the GNAT GPL 2012 distribution for Mac OS X,
> Clang++ is Apple's v 3.1.
>
>
> === 8< === 8< === 8< === 8< ===
>
> package Fringes is
>
> pragma Pure (Fringes);
>
> type Full_Index is mod 2**32;
> subtype Num is Full_Index Range 0 .. 100_000;
> subtype Index is Full_Index range 0 .. 1000;
> Len : constant Full_Index := Index'Last - Index'First + 1;
>
> type Matrix_1D is
> array (Index'First .. Index'First + Len * Len - 1) of Num;
> type Matrix_2 is array (Index, Index) of Num;
>
> procedure Compute_1D (A : in out Matrix_1D);
> procedure Compute_2 (A : in out Matrix_2);
>
> end Fringes;
>
> package body Fringes is
>
> --
> -- Each Compute procedure has A pointless loop that assigns each
> -- inner field the sum of non-diagonal neighbors modulo Num'Last
> --
>
> procedure NOT_USED_Compute_1D_Slow (A : in out Matrix_1D) is
> J : Full_Index;
> begin
> J := A'First + Len;
> loop
> exit when J > A'Last - Len;
> for K in J + 1 .. J + Len - 1 - 1 loop
> A (K) := (A(K + 1)
> + A(K - Len)
> + A(K - 1)
> + A(K + Len)) mod Num'Last;
> end loop;
> J := J + Len;
> end loop;
> end NOT_USED_Compute_1D_Slow;
>
> procedure Compute_1D (A : in out Matrix_1D) is
> begin
> for K in A'First + Len + 1 .. A'Last - Len - 1 loop
> case K mod Len is
> when 0 | Len - 1 => null;
> when others =>
> A (K) := (A(K + 1)
> + A(K - Len)
> + A(K - 1)
> + A(K + Len)) mod Num'Last;
> end case;
> end loop;
> end Compute_1D;
>
> procedure Compute_2 (A : in out Matrix_2) is
> begin
> for J in A'First(1) + 1 .. A'Last(1) - 1 loop
> for K in A'First(2) + 1 .. A'Last(2) - 1 loop
> A (J, K) := (A(J, K + 1)
> + A(j - 1, K)
> + A(J, K - 1)
> + A(j + 1, K)) mod Num'Last;
> end loop;
> end loop;
> end Compute_2;
>
> end Fringes;
>
>
> with Fringes;
> with Ada.Command_Line, Ada.Real_Time;
> procedure Test_Fringes is
>
> use Ada.Real_Time;
>
> Runs : constant Natural :=
> Natural'Value (Ada.Command_Line.Argument(1));
>
> Start, Stop: Time;
>
> package Os_Or_Gnat_Stacksize_Nuisance is
> use Fringes;
> type P1 is access Matrix_1D;
> type P2 is access Matrix_2;
> M1P : constant P1 := new Matrix_1D'(others => 123);
> M2p : constant P2 := new Matrix_2'(Index => (Index => 123));
> M1D : Matrix_1D renames M1P.all;
> M2 : Matrix_2 renames M2P.all;
> end Os_Or_Gnat_Stacksize_Nuisance;
> use Os_Or_Gnat_Stacksize_Nuisance;
>
> procedure Print_Timing (Part : String; N : Fringes.Num) is separate;
> use type Fringes.Full_Index;
> begin
> Start := Clock;
> for Run in 1 .. Runs loop
> Fringes.Compute_1D (M1D);
> end loop;
> Stop := Clock;
> Print_Timing ("1D", M1D ((M1D'First + 1 * Fringes.Len) + (2)));
>
> Start := Clock;
> for Run in 1 .. Runs loop
> Fringes.Compute_2 (M2);
> end loop;
> Stop := Clock;
> Print_Timing ("2D", M2 (1, 2));
> end Test_Fringes;
>
> with Ada.Text_IO;
> separate (Test_Fringes)
> procedure Print_Timing (Part : String; N : Fringes.Num) is
> use Ada.Text_IO;
> begin
> Put (Part);
> Put (": ");
> Put (Duration'Image (To_Duration(Stop - Start)));
> Put (Fringes.Num'Image (N));
> New_Line;
> end print_timing;
>
>
> === 8< === 8< === 8< === 8< ===
>
> #include <stdint.h>
>
> namespace Fringes {
>
> typedef uint32_t Full_Index;
> #define Num_Last 100000
> typedef Full_Index Num;
> #define Index_Last 1000
> typedef Full_Index Index;
> const Full_Index Len = Index_Last + 1;
>
> #define A_Last ((Len * Len) - 1)
> typedef Num Matrix_1D[A_Last + 1];
>
> void Compute_1D (Matrix_1D& A);
> // Each Compute procedure has a pointless loop that assigns each
> // inner field the sum of non-diagonal neighbors modulo Num_Last
> void Compute_1D (Matrix_1D& A) {
>
> for (Full_Index K = Len + 1; K < A_Last - Len - 1; ++K) {
> switch (K % Len) {
> case 0: case Len - 1: break;
> default:
> A [K] = (A[K + 1]
> + A[K - Len]
> + A[K - 1]
> + A[K + Len]) % Num_Last;
> }
> }
> }
> }
>
> #include <sys/time.h>
> #include <string>
>
> class Reporter {
> struct timeval Start, Stop;
> public:
> void logStart();
> void logStop();
> void Print_Timing (const std::string Part, const Fringes::Num N);
> };
>
>
> #include <sstream>
>
> int main(int argc, char* argv[]) {
> using namespace Fringes;
>
> int Runs;
> Matrix_1D* M1D = new Matrix_1D[Len];
> Reporter history;
>
> if (argc == 0) {
> throw "usage";
> } else {
> std::istringstream converter(argv[1]);
>
> if (! ((converter >> Runs) && (Runs >= 0))) {
> throw "argument error?";
> }
> }
>
> history.logStart();
> for (int Run = 1; Run <= Runs; ++Run) {
> Compute_1D (*M1D);
> }
> history.logStop();
> history.Print_Timing ("1D", (*M1D) [(1 * Len) + (2)]);
> }
>
>
> #include <iostream>
> #include <sys/time.h>
>
> void Reporter::Print_Timing (const std::string Part, const Fringes::Num N) {
> double difference = (Stop.tv_sec - Start.tv_sec) * 1000000
> + (Stop.tv_usec - Start.tv_usec);
> std::cout << Part << ": ";
> std::cout << (difference / 1000000.0) << " " << N;
> std::cout << '\n';
> }
>
> void Reporter::logStart() {
> gettimeofday(&this->Start, 0);
> }
>
> void Reporter::logStop() {
> gettimeofday(&this->Stop, 0);
> }


On Sunday, 1 July 2012 18:45:22 UTC+1, Georg Bauhaus wrote:
> On 29.06.12 19:03, Keean Schupke wrote:
> > Note: this is not really answering my question about the lack of improvement due to profile-guided compilation - but perhaps that side of things is more a compiler issue. Does anyone have any ideas about that side of things?

>
> Some more observations, collected with the help of the simple
> two example programs appended below.
>
> For Ada and GNAT at least, any advantages to be had from profile-guided
> compilation seem to vary with optimization options and sizes of data.
> The same is true about a 1D approach vs a 2D approach.
>
> Among the various arrangements, one of the best I have got
> from both GNAT GPL 2012, and from an FSF GNAT as of June 2012
> uses
> *) -O2 -funroll-loops -gnatp
> *) the 2D approach
>
> (If it means a anything, adding -ftree-vectorize to the -O2 set
> produces only the second best Ada result, the same as the -O3 times
> listed below. I had chosen -ftree-vectorize as this switch is in the
> O3 set.) Trying profile guided compilation at higher optimization
> levels consistently slowed the Ada program down (other Ada programs
> were faster after profile guided compilation, as mentioned elsewhere).
>
> The result at -O2 seems nice, though, in that the 2D approach is
> natural, the compiler switches are the ones that the manual recommends.
> and this combination produces the fastest Ada program.
>
> In short, from worst to best 1D / 2D, Runs = 100:
>
> Ada profile guided -> .88 / .80 (this program only!)
> Ada at -O3 -> .68 / .66
> C++ at -O3 -> .66
> Ada at -O2 -funr.. -> .68 / .47
> C++ profile guided -> .31
>
>
> Some differences seem to vary with hardware, too.
> On one computer I have seen a minor speedup, on another
> a very small slowdown, and a different ratio of 1D / 2D.
>
> For the C++ case I have tested a 1D approach only, not
> knowing how to write 2D arrays in C++ using C style arrays.
> I'm not a C++ programmer, apologies for my C++.
>
>
> Ada at -O2 -funroll-loops -gnatp -mtune=native
> 1D: 0.680750000 27303
> 2D: 0.469094000 27303 <-- Best for Ada!
>
> Ada at -O3 -gnatNp -fomit-frame-pointer -mtune=native
> 1D: 0.676616000 27303
> 2D: 0.664194000 27303
>
> previous with profile-guided compilation
> 1D: 0.888206000 27303
> 2D: 0.806196000 27303
>
> C++ at -O3 -mtune=native
> 1D: 0.681721 0
>
> previous with profile-guided compilation
> 1D: 0.31165 0
>
> Clang++
> 1D: 0.611522 0
>
> The GNU C++ compiler is from the GNAT GPL 2012 distribution for Mac OS X,
> Clang++ is Apple's v 3.1.
>
>
> === 8< === 8< === 8< === 8< ===
>
> package Fringes is
>
> pragma Pure (Fringes);
>
> type Full_Index is mod 2**32;
> subtype Num is Full_Index Range 0 .. 100_000;
> subtype Index is Full_Index range 0 .. 1000;
> Len : constant Full_Index := Index'Last - Index'First + 1;
>
> type Matrix_1D is
> array (Index'First .. Index'First + Len * Len - 1) of Num;
> type Matrix_2 is array (Index, Index) of Num;
>
> procedure Compute_1D (A : in out Matrix_1D);
> procedure Compute_2 (A : in out Matrix_2);
>
> end Fringes;
>
> package body Fringes is
>
> --
> -- Each Compute procedure has A pointless loop that assigns each
> -- inner field the sum of non-diagonal neighbors modulo Num'Last
> --
>
> procedure NOT_USED_Compute_1D_Slow (A : in out Matrix_1D) is
> J : Full_Index;
> begin
> J := A'First + Len;
> loop
> exit when J > A'Last - Len;
> for K in J + 1 .. J + Len - 1 - 1 loop
> A (K) := (A(K + 1)
> + A(K - Len)
> + A(K - 1)
> + A(K + Len)) mod Num'Last;
> end loop;
> J := J + Len;
> end loop;
> end NOT_USED_Compute_1D_Slow;
>
> procedure Compute_1D (A : in out Matrix_1D) is
> begin
> for K in A'First + Len + 1 .. A'Last - Len - 1 loop
> case K mod Len is
> when 0 | Len - 1 => null;
> when others =>
> A (K) := (A(K + 1)
> + A(K - Len)
> + A(K - 1)
> + A(K + Len)) mod Num'Last;
> end case;
> end loop;
> end Compute_1D;
>
> procedure Compute_2 (A : in out Matrix_2) is
> begin
> for J in A'First(1) + 1 .. A'Last(1) - 1 loop
> for K in A'First(2) + 1 .. A'Last(2) - 1 loop
> A (J, K) := (A(J, K + 1)
> + A(j - 1, K)
> + A(J, K - 1)
> + A(j + 1, K)) mod Num'Last;
> end loop;
> end loop;
> end Compute_2;
>
> end Fringes;
>
>
> with Fringes;
> with Ada.Command_Line, Ada.Real_Time;
> procedure Test_Fringes is
>
> use Ada.Real_Time;
>
> Runs : constant Natural :=
> Natural'Value (Ada.Command_Line.Argument(1));
>
> Start, Stop: Time;
>
> package Os_Or_Gnat_Stacksize_Nuisance is
> use Fringes;
> type P1 is access Matrix_1D;
> type P2 is access Matrix_2;
> M1P : constant P1 := new Matrix_1D'(others => 123);
> M2p : constant P2 := new Matrix_2'(Index => (Index => 123));
> M1D : Matrix_1D renames M1P.all;
> M2 : Matrix_2 renames M2P.all;
> end Os_Or_Gnat_Stacksize_Nuisance;
> use Os_Or_Gnat_Stacksize_Nuisance;
>
> procedure Print_Timing (Part : String; N : Fringes.Num) is separate;
> use type Fringes.Full_Index;
> begin
> Start := Clock;
> for Run in 1 .. Runs loop
> Fringes.Compute_1D (M1D);
> end loop;
> Stop := Clock;
> Print_Timing ("1D", M1D ((M1D'First + 1 * Fringes.Len) + (2)));
>
> Start := Clock;
> for Run in 1 .. Runs loop
> Fringes.Compute_2 (M2);
> end loop;
> Stop := Clock;
> Print_Timing ("2D", M2 (1, 2));
> end Test_Fringes;
>
> with Ada.Text_IO;
> separate (Test_Fringes)
> procedure Print_Timing (Part : String; N : Fringes.Num) is
> use Ada.Text_IO;
> begin
> Put (Part);
> Put (": ");
> Put (Duration'Image (To_Duration(Stop - Start)));
> Put (Fringes.Num'Image (N));
> New_Line;
> end print_timing;
>
>
> === 8< === 8< === 8< === 8< ===
>
> #include <stdint.h>
>
> namespace Fringes {
>
> typedef uint32_t Full_Index;
> #define Num_Last 100000
> typedef Full_Index Num;
> #define Index_Last 1000
> typedef Full_Index Index;
> const Full_Index Len = Index_Last + 1;
>
> #define A_Last ((Len * Len) - 1)
> typedef Num Matrix_1D[A_Last + 1];
>
> void Compute_1D (Matrix_1D& A);
> // Each Compute procedure has a pointless loop that assigns each
> // inner field the sum of non-diagonal neighbors modulo Num_Last
> void Compute_1D (Matrix_1D& A) {
>
> for (Full_Index K = Len + 1; K < A_Last - Len - 1; ++K) {
> switch (K % Len) {
> case 0: case Len - 1: break;
> default:
> A [K] = (A[K + 1]
> + A[K - Len]
> + A[K - 1]
> + A[K + Len]) % Num_Last;
> }
> }
> }
> }
>
> #include <sys/time.h>
> #include <string>
>
> class Reporter {
> struct timeval Start, Stop;
> public:
> void logStart();
> void logStop();
> void Print_Timing (const std::string Part, const Fringes::Num N);
> };
>
>
> #include <sstream>
>
> int main(int argc, char* argv[]) {
> using namespace Fringes;
>
> int Runs;
> Matrix_1D* M1D = new Matrix_1D[Len];
> Reporter history;
>
> if (argc == 0) {
> throw "usage";
> } else {
> std::istringstream converter(argv[1]);
>
> if (! ((converter >> Runs) && (Runs >= 0))) {
> throw "argument error?";
> }
> }
>
> history.logStart();
> for (int Run = 1; Run <= Runs; ++Run) {
> Compute_1D (*M1D);
> }
> history.logStop();
> history.Print_Timing ("1D", (*M1D) [(1 * Len) + (2)]);
> }
>
>
> #include <iostream>
> #include <sys/time.h>
>
> void Reporter::Print_Timing (const std::string Part, const Fringes::Num N) {
> double difference = (Stop.tv_sec - Start.tv_sec) * 1000000
> + (Stop.tv_usec - Start.tv_usec);
> std::cout << Part << ": ";
> std::cout << (difference / 1000000.0) << " " << N;
> std::cout << '\n';
> }
>
> void Reporter::logStart() {
> gettimeofday(&this->Start, 0);
> }
>
> void Reporter::logStop() {
> gettimeofday(&this->Stop, 0);
> }



Interesting stuff, I need a bit more time to read and digest, but I did notice one thing that I think is worth pointing out.

The real benefit (and performance gains) from profile guided compilation come from correcting branch prediction. As such the gains will be most apparent when there is an 'if' statement in the inner loop of the code. Try something where you are taking the sign of an int in the formula and have three cases <0 =0 >0.

The other point is that for any single branch there is at least a 50% chance it is already optimal, for example:

if x > 0 then
y := 1 + y;
else
y := 1 - y;
end if;

will get implemented in assembly as something like:

tst rx
ble xyz
ld #1, r0
sub r0, ry
bra zyx
..xyz
add #1, ry
..zyx

OR

tst rx
bgt xyz
add #1, ry
bra zyx
..xyz
ld #1, r0
sub r0, ry
..zyx

whether this is optimal depends on the number of times x <= 0 and on whatthe CPU's branch predictor guesses for 'ble xyz'. Most branch predictors start with the assumption that a backwards branch will be taken (looping) and a forward branch will not be taken, but then they also build statistical tables based on the address of the branch instruction to improve that guesson future iterations.

The speed difference can be large because a branch predict failure can havea high penalty - a CPU pipe stall. Modern CPUs use speculative execution (they keep the pipe filled by using branch-predict to assume whether the branch will or will not be taken) so if the branch predict guesses correctly there is no pipe stall and the CPU continues through the code at full speed.A pipe stall can cost 10s or 100s of clock cycles each time it happens.

I initially expected the dynamic branch predictor would be doing a good joband changing the order of the 'if' statements would have no effect, but benchmarking on 64bit x86 hardware showed a strong effect for changing the order of branches.

Anyway the point is, even if you have an 'if' statement in the inner loop, you may be lucky and already have it the 'faster' way round.

So you need to benchmark:

if x > 0 then
y := 1 + y;
else
y := 1 - y;
end if;

AND

if x <= 0 then
y := 1 - y;
else
y := 1 + y;
end if;

Then you can confirm that profile guided compilation gives you the faster of the two benchmark times, no matter which way round the if statement is written in the code.

Obviously for one if statement in the inner loop you can practically benchmark both alternatives and chose the fastest.

With my Monte-Carlo simulation, the inner loop logic is a complex state-machine, and it would be impractical to try all possibly combinations (due to the explosion of combinations at the speed of 2^N, just 8 branches would require 256 code change and rebuild-cycles, 16 = 65536 etc...

This is where profile-guided compilation is a valuable tool. Both the Ada code and C++ code share the same branching in the state machine, and both have the same initial performance now (approx 40k simulations per second). However the C++ compiler gains 25% from the profile guided compilation (by changing branch orders and re-ordering functions). This gain should be available to the Ada compiler too, as the same branches are getting generated in the output assembly - but its just not working as well.

My latest results are better, I now have both C++ and Ada running from the same GCC-4.7.1 compiler and my latest performance figures are:

Just so its clear, you first build with '-fprofile-generate' then you run the code to build the .gcda profile files, then you build again with '-fprofile-use' to build the 'profile-guided' version. My code runs 1,000,000 iterations between the 'generate' and the 'use' steps.


C++ first compile: 39k
C++ profile guided: 55k

Ada first compile: 41k
Ada profile guided: 46k


So gcc-4.7.1 seems to improve the results slightly for Ada, but it seems tobe still falling short of the potential improvement. I am becoming more convinced that there is nothing in the Ada code itself that is causing this, but more likely something about the Ada GCC front-end itself not using the profiling information as effectively as C++.


Cheers,
Keean.
Reply With Quote
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off




All times are GMT. The time now is 02:15 AM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.