Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.* 1 > Newsgroup comp.lang.awk

Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 02-22-2012, 03:59 PM
Anand Hariharan
Guest
 
Posts: n/a
Default Concatenate to element of an associative array

Should the element of an associative array be numeric, one can use
operators like += or ++ to avoid duplicate lookups.

If it were a string instead, how does one do concatenate another
string to it?

foo[$1, $2] = foo[$1, $2] $3

appears wasteful.

thank you for listening,
- Anand
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #2 (permalink)  
Old 02-22-2012, 04:27 PM
Ed Morton
Guest
 
Posts: n/a
Default Re: Concatenate to element of an associative array

Anand Hariharan <mailto.anand.hariharan@gmail.com> wrote:

> Should the element of an associative array be numeric, one can use
> operators like += or ++ to avoid duplicate lookups.
>
> If it were a string instead, how does one do concatenate another
> string to it?
>
> foo[$1, $2] = foo[$1, $2] $3
>
> appears wasteful.


That is how you do it.

Internally the += for adding an int might be doing 2 lookups instead of 1
or the above may get optimised by the interpreter to 1 lookup so it's hard
to say if that syntax procduces code that's wasteful in terms of
performance or not compared to the version that adds an int.

Also, I suspect the memory manipulation that's going on internally to
allow the string produced by the array access on the left side of the =
sign to become longer after the concatenation probably dwarfs any run-time
impact of accessing the array on the right side of the = sign.

Ed.
>
> thank you for listening,
> - Anand
>



Posted using www.webuse.net
Reply With Quote
  #3 (permalink)  
Old 02-23-2012, 11:42 AM
Janis Papanagnou
Guest
 
Posts: n/a
Default Re: Concatenate to element of an associative array

Am 22.02.2012 18:27, schrieb Ed Morton:
> Anand Hariharan<mailto.anand.hariharan@gmail.com> wrote:
>
>> Should the element of an associative array be numeric, one can use
>> operators like += or ++ to avoid duplicate lookups.
>>
>> If it were a string instead, how does one do concatenate another
>> string to it?
>>
>> foo[$1, $2] = foo[$1, $2] $3
>>
>> appears wasteful.

>
> That is how you do it.
>
> Internally the += for adding an int might be doing 2 lookups instead of 1
> or the above may get optimised by the interpreter to 1 lookup so it's hard
> to say if that syntax procduces code that's wasteful in terms of
> performance or not compared to the version that adds an int.


Hmm.. - have you any evidence for that? While such implementations
could not be ruled out per se they are also not very likely, I'd say,
since the days when 2-address machines have been introduced. (Though
a scalar 2-address operation will differ compared to the respective
string operation anyway.)

Also consider that the OP's term of being "wasteful" can also be seen
at the programming level (as opposed to the machine or interpreter
level), and in fact I've understood the posting that way. It's not only
wasteful in typing, also wasteful to confirm correct coding, and prone
to errors from duplicating code. Semantically there's a difference in
"concatenate A to B and put the result in C" (resembling COBOL[*] ;-)
and "append B to A". The latter intention is also easier to understand
than a version with duplicated common subexpressions.

In the current case we should also assume that an optimization will
not happen on lexicalic level, rather that you have to create parsing
structures resembling (in the given case) the expressions
evaluate constant 1
access field for constant 1
evaluate constant 2
access field for constant 2
convert field1 to string
convert field2 to string
concatenate field1 and field2 to a tempoprary
address array element with temporary
Those parsing subtrees will be necessary for the L-value and for the
R-value part of the righthand subexpression. Matching equivalence is
wasteful for a compiler to analyse during compile time and it's even
more wasteful for an interpreter at runtime.

Supporting something like a += b for strings would be nice.

A problem I see is that this gives the impression that the primitive
concatenation operator would be '+', but that isn't the case in awk.

>
> Also, I suspect the memory manipulation that's going on internally to
> allow the string produced by the array access on the left side of the =
> sign to become longer after the concatenation probably dwarfs any run-time
> impact of accessing the array on the right side of the = sign.


The point is that in case of a 3-address operation you typically always
waste time and space for copying the objects, while in the "append"
case you lose that gain only if you have implemented some too primitive
memory allocation method and your available prepeared capacity for the
object gets constantly reallocated in size. (But even a reallocation
can be efficient, depending on the underlying library function.)

Janis

>
> Ed.
>>
>> thank you for listening,
>> - Anand
>>

>
>
> Posted using www.webuse.net

[*] AFAIR, in COBOL, there were statements like: multiply A by B given C
Reply With Quote
  #4 (permalink)  
Old 02-23-2012, 12:33 PM
Ed Morton
Guest
 
Posts: n/a
Default Re: Concatenate to element of an associative array

On 2/23/2012 6:42 AM, Janis Papanagnou wrote:
> Am 22.02.2012 18:27, schrieb Ed Morton:
>> Anand Hariharan<mailto.anand.hariharan@gmail.com> wrote:
>>
>>> Should the element of an associative array be numeric, one can use
>>> operators like += or ++ to avoid duplicate lookups.
>>>
>>> If it were a string instead, how does one do concatenate another
>>> string to it?
>>>
>>> foo[$1, $2] = foo[$1, $2] $3
>>>
>>> appears wasteful.

>>
>> That is how you do it.
>>
>> Internally the += for adding an int might be doing 2 lookups instead of 1
>> or the above may get optimised by the interpreter to 1 lookup so it's hard
>> to say if that syntax procduces code that's wasteful in terms of
>> performance or not compared to the version that adds an int.

>
> Hmm.. - have you any evidence for that?


The point is just that you can't tell for sure by the syntax of a language
what's happening under the hood.

While such implementations
> could not be ruled out per se they are also not very likely, I'd say,
> since the days when 2-address machines have been introduced. (Though
> a scalar 2-address operation will differ compared to the respective
> string operation anyway.)
>
> Also consider that the OP's term of being "wasteful" can also be seen
> at the programming level (as opposed to the machine or interpreter
> level), and in fact I've understood the posting that way. It's not only
> wasteful in typing, also wasteful to confirm correct coding, and prone
> to errors from duplicating code. Semantically there's a difference in
> "concatenate A to B and put the result in C" (resembling COBOL[*] ;-)
> and "append B to A". The latter intention is also easier to understand
> than a version with duplicated common subexpressions.
>
> In the current case we should also assume that an optimization will
> not happen on lexicalic level, rather that you have to create parsing
> structures resembling (in the given case) the expressions
> evaluate constant 1
> access field for constant 1
> evaluate constant 2
> access field for constant 2
> convert field1 to string
> convert field2 to string
> concatenate field1 and field2 to a tempoprary
> address array element with temporary
> Those parsing subtrees will be necessary for the L-value and for the
> R-value part of the righthand subexpression. Matching equivalence is
> wasteful for a compiler to analyse during compile time and it's even
> more wasteful for an interpreter at runtime.
>
> Supporting something like a += b for strings would be nice.
>
> A problem I see is that this gives the impression that the primitive
> concatenation operator would be '+', but that isn't the case in awk.


Exactly, you'd need to introduce some new concatenation operator since awk (and
those of us reading awk!) uses clues like "+" to tell if it's operating on ints
or strings. I agree a "+=" concatenation syntax for strings would be nice, but
not worthwhile at all.

>>
>> Also, I suspect the memory manipulation that's going on internally to
>> allow the string produced by the array access on the left side of the =
>> sign to become longer after the concatenation probably dwarfs any run-time
>> impact of accessing the array on the right side of the = sign.

>
> The point is that in case of a 3-address operation you typically always
> waste time and space for copying the objects, while in the "append"
> case you lose that gain only if you have implemented some too primitive
> memory allocation method and your available prepeared capacity for the
> object gets constantly reallocated in size. (But even a reallocation
> can be efficient, depending on the underlying library function.)


We've seen some tests here, e.g.:

$ time awk '{ str = str $0 } END{ print str }' file100K > /dev/null

real 0m0.268s
user 0m0.140s
sys 0m0.124s

$ time awk '{ print $0 }' file100K > /dev/null

real 0m0.092s
user 0m0.093s
sys 0m0.015s

that show that string concatenation is slower than I/O operations (which usually
get the reputation for being slow) hence the usual advice that printing strings
as you go is faster than concatenating them and printing the one resulting
string at the END.

Ed.

>
> Janis
>
>>
>> Ed.
>>>
>>> thank you for listening,
>>> - Anand
>>>

>>
>>
>> Posted using www.webuse.net

>
>[*] AFAIR, in COBOL, there were statements like: multiply A by B given C


Reply With Quote
  #5 (permalink)  
Old 02-23-2012, 12:57 PM
Ed Morton
Guest
 
Posts: n/a
Default Re: Concatenate to element of an associative array

On 2/23/2012 7:33 AM, Ed Morton wrote:
> On 2/23/2012 6:42 AM, Janis Papanagnou wrote:
>> Am 22.02.2012 18:27, schrieb Ed Morton:
>>> Anand Hariharan<mailto.anand.hariharan@gmail.com> wrote:
>>>
>>>> Should the element of an associative array be numeric, one can use
>>>> operators like += or ++ to avoid duplicate lookups.
>>>>
>>>> If it were a string instead, how does one do concatenate another
>>>> string to it?
>>>>
>>>> foo[$1, $2] = foo[$1, $2] $3
>>>>
>>>> appears wasteful.
>>>
>>> That is how you do it.
>>>
>>> Internally the += for adding an int might be doing 2 lookups instead of 1
>>> or the above may get optimised by the interpreter to 1 lookup so it's hard
>>> to say if that syntax procduces code that's wasteful in terms of
>>> performance or not compared to the version that adds an int.

>>
>> Hmm.. - have you any evidence for that?

>
> The point is just that you can't tell for sure by the syntax of a language
> what's happening under the hood.
>
> While such implementations
>> could not be ruled out per se they are also not very likely, I'd say,
>> since the days when 2-address machines have been introduced. (Though
>> a scalar 2-address operation will differ compared to the respective
>> string operation anyway.)
>>
>> Also consider that the OP's term of being "wasteful" can also be seen
>> at the programming level (as opposed to the machine or interpreter
>> level), and in fact I've understood the posting that way. It's not only
>> wasteful in typing, also wasteful to confirm correct coding, and prone
>> to errors from duplicating code. Semantically there's a difference in
>> "concatenate A to B and put the result in C" (resembling COBOL[*] ;-)
>> and "append B to A". The latter intention is also easier to understand
>> than a version with duplicated common subexpressions.
>>
>> In the current case we should also assume that an optimization will
>> not happen on lexicalic level, rather that you have to create parsing
>> structures resembling (in the given case) the expressions
>> evaluate constant 1
>> access field for constant 1
>> evaluate constant 2
>> access field for constant 2
>> convert field1 to string
>> convert field2 to string
>> concatenate field1 and field2 to a tempoprary
>> address array element with temporary
>> Those parsing subtrees will be necessary for the L-value and for the
>> R-value part of the righthand subexpression. Matching equivalence is
>> wasteful for a compiler to analyse during compile time and it's even
>> more wasteful for an interpreter at runtime.
>>
>> Supporting something like a += b for strings would be nice.
>>
>> A problem I see is that this gives the impression that the primitive
>> concatenation operator would be '+', but that isn't the case in awk.

>
> Exactly, you'd need to introduce some new concatenation operator since awk (and
> those of us reading awk!) uses clues like "+" to tell if it's operating on ints
> or strings. I agree a "+=" concatenation syntax for strings would be nice, but
> not worthwhile at all.
>
>>>
>>> Also, I suspect the memory manipulation that's going on internally to
>>> allow the string produced by the array access on the left side of the =
>>> sign to become longer after the concatenation probably dwarfs any run-time
>>> impact of accessing the array on the right side of the = sign.

>>
>> The point is that in case of a 3-address operation you typically always
>> waste time and space for copying the objects, while in the "append"
>> case you lose that gain only if you have implemented some too primitive
>> memory allocation method and your available prepeared capacity for the
>> object gets constantly reallocated in size. (But even a reallocation
>> can be efficient, depending on the underlying library function.)

>
> We've seen some tests here, e.g.:
>
> $ time awk '{ str = str $0 } END{ print str }' file100K > /dev/null
>
> real 0m0.268s
> user 0m0.140s
> sys 0m0.124s
>
> $ time awk '{ print $0 }' file100K > /dev/null
>
> real 0m0.092s
> user 0m0.093s
> sys 0m0.015s


I wondered how much the final "print" in the string concatenation example above
was contributing to the time:

$ time awk '{ str = str $0 }' file100K > /dev/null

real 0m0.242s
user 0m0.186s
sys 0m0.093s

Turned out to be negligible.

Ed.
>
> that show that string concatenation is slower than I/O operations (which usually
> get the reputation for being slow) hence the usual advice that printing strings
> as you go is faster than concatenating them and printing the one resulting
> string at the END.
>
> Ed.
>
>>
>> Janis
>>
>>>
>>> Ed.
>>>>
>>>> thank you for listening,
>>>> - Anand
>>>>
>>>
>>>
>>> Posted using www.webuse.net

>>
>>[*] AFAIR, in COBOL, there were statements like: multiply A by B given C

>


Reply With Quote
  #6 (permalink)  
Old 02-23-2012, 09:09 PM
Aharon Robbins
Guest
 
Posts: n/a
Default Re: Concatenate to element of an associative array

[ Previous text deleted ... ]

To sum up, the awk language does not have the equivalent of += for
string concatenation. This would have been nice, and would have made
sense if awk had initially been given a visible concatenation operator
such as & or @.

But it's way too late now.

Gawk does optimize

a = a b

assignments, using realloc() to increase the length of `a' and copy in the
value of `b'. This only works for scalars though, not array elements.

This optimization, at least on linux, makes large string concatenations
blazingly fast compared to the simple

compute sum of lengths of a and b
allocate a new buffer
copy in both
release a's old buffer
assign new one to a

which is what gawk did originally.
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
Reply With Quote
  #7 (permalink)  
Old 02-23-2012, 11:33 PM
Janis Papanagnou
Guest
 
Posts: n/a
Default Re: Concatenate to element of an associative array

On 23.02.2012 14:33, Ed Morton wrote:
> On 2/23/2012 6:42 AM, Janis Papanagnou wrote:

[...]
>> Supporting something like a += b for strings would be nice.
>>
>> A problem I see is that this gives the impression that the primitive
>> concatenation operator would be '+', but that isn't the case in awk.

>
> Exactly, you'd need to introduce some new concatenation operator since awk
> (and those of us reading awk!) uses clues like "+" to tell if it's operating
> on ints or strings. I agree a "+=" concatenation syntax for strings would be
> nice, but not worthwhile at all.


Well, specifically for the non-trivial case that had been presented in
this thread I disagree. In short A = A @ B where A is a non-trivial
expression as shown with the arrays. In such cases it is very useful,
and not only in the awk language. For numbers the A += B is trivial
and for simple string expressions there's - at least in gawk, as Aharon
noted in a followup - an optimization in expressions like A = A B .

Janis

[...]
Reply With Quote
  #8 (permalink)  
Old 02-27-2012, 06:26 PM
Anand Hariharan
Guest
 
Posts: n/a
Default Re: Concatenate to element of an associative array

On Feb 23, 4:09*pm, arn...@skeeve.com (Aharon Robbins) wrote:
> [ Previous text deleted ... ]
>
> To sum up, the awk language does not have the equivalent of += for
> string concatenation. This would have been nice, and would have made
> sense if awk had initially been given a visible concatenation operator
> such as & or @.
>
> But it's way too late now.
>
> Gawk does optimize
>
> * * * * a = a b
>
> assignments, using realloc() to increase the length of `a' and copy in the
> value of `b'. *This only works for scalars though, not array elements.
>

(...)

I agree there isn't a need (or worthwhile) to introduce a new
concatenation operator.

However, I do not see a problem in providing a new function (I propose
to unimaginatively call it 'strcat') as an extension that avoids the
issue of the duplicate look-up (both code-maintenance-wise as well as
performance-wise). The function could receive the first argument, a
scalar, by reference (which is why it has to be built-in), and could
have both the side-effect of modifying the first argument as well as
returning the concatenated string.

Thanks to Ed and Arnold and special thanks to Janis for all your
posts.

sincerely,
- Anand


Reply With Quote
  #9 (permalink)  
Old 02-27-2012, 08:14 PM
Aharon Robbins
Guest
 
Posts: n/a
Default Re: Concatenate to element of an associative array

In article <1ad17906-3bee-443e-add3-22ab1ff28bf0@j8g2000yqm.googlegroups.com>,
Anand Hariharan <mailto.anand.hariharan@gmail.com> wrote:
>On Feb 23, 4:09*pm, arn...@skeeve.com (Aharon Robbins) wrote:
>> [ Previous text deleted ... ]
>>
>> To sum up, the awk language does not have the equivalent of += for
>> string concatenation. This would have been nice, and would have made
>> sense if awk had initially been given a visible concatenation operator
>> such as & or @.
>>
>> But it's way too late now.
>>
>> Gawk does optimize
>>
>> * * * * a = a b
>>
>> assignments, using realloc() to increase the length of `a' and copy in the
>> value of `b'. *This only works for scalars though, not array elements.
>>

>(...)
>
>I agree there isn't a need (or worthwhile) to introduce a new
>concatenation operator.
>
>However, I do not see a problem in providing a new function (I propose
>to unimaginatively call it 'strcat')


Gawk already has too many extensions. That is a problem.

>as an extension that avoids the
>issue of the duplicate look-up (both code-maintenance-wise as well as
>performance-wise). The function could receive the first argument, a
>scalar, by reference (which is why it has to be built-in), and could
>have both the side-effect of modifying the first argument as well as
>returning the concatenated string.


If you really need this, it's best if you write a loadable built-in
to do it. Another option is to simply do all your work on a non-array
variable and then assign the final value to the array element when you're
done. Since gawk reference counts its strings, this remains efficient.

>Thanks to Ed and Arnold and special thanks to Janis for all your
>posts.


You're welcome.

Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
Reply With Quote
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off




All times are GMT. The time now is 04:55 AM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.