|
|||
|
Should the element of an associative array be numeric, one can use
operators like += or ++ to avoid duplicate lookups. If it were a string instead, how does one do concatenate another string to it? foo[$1, $2] = foo[$1, $2] $3 appears wasteful. thank you for listening, - Anand |
|
|
||||
|
||||
|
|
|
|||
|
Anand Hariharan <mailto.anand.hariharan@gmail.com> wrote:
> Should the element of an associative array be numeric, one can use > operators like += or ++ to avoid duplicate lookups. > > If it were a string instead, how does one do concatenate another > string to it? > > foo[$1, $2] = foo[$1, $2] $3 > > appears wasteful. That is how you do it. Internally the += for adding an int might be doing 2 lookups instead of 1 or the above may get optimised by the interpreter to 1 lookup so it's hard to say if that syntax procduces code that's wasteful in terms of performance or not compared to the version that adds an int. Also, I suspect the memory manipulation that's going on internally to allow the string produced by the array access on the left side of the = sign to become longer after the concatenation probably dwarfs any run-time impact of accessing the array on the right side of the = sign. Ed. > > thank you for listening, > - Anand > Posted using www.webuse.net |
|
|||
|
Am 22.02.2012 18:27, schrieb Ed Morton:
> Anand Hariharan<mailto.anand.hariharan@gmail.com> wrote: > >> Should the element of an associative array be numeric, one can use >> operators like += or ++ to avoid duplicate lookups. >> >> If it were a string instead, how does one do concatenate another >> string to it? >> >> foo[$1, $2] = foo[$1, $2] $3 >> >> appears wasteful. > > That is how you do it. > > Internally the += for adding an int might be doing 2 lookups instead of 1 > or the above may get optimised by the interpreter to 1 lookup so it's hard > to say if that syntax procduces code that's wasteful in terms of > performance or not compared to the version that adds an int. Hmm.. - have you any evidence for that? While such implementations could not be ruled out per se they are also not very likely, I'd say, since the days when 2-address machines have been introduced. (Though a scalar 2-address operation will differ compared to the respective string operation anyway.) Also consider that the OP's term of being "wasteful" can also be seen at the programming level (as opposed to the machine or interpreter level), and in fact I've understood the posting that way. It's not only wasteful in typing, also wasteful to confirm correct coding, and prone to errors from duplicating code. Semantically there's a difference in "concatenate A to B and put the result in C" (resembling COBOL[*] ;-) and "append B to A". The latter intention is also easier to understand than a version with duplicated common subexpressions. In the current case we should also assume that an optimization will not happen on lexicalic level, rather that you have to create parsing structures resembling (in the given case) the expressions evaluate constant 1 access field for constant 1 evaluate constant 2 access field for constant 2 convert field1 to string convert field2 to string concatenate field1 and field2 to a tempoprary address array element with temporary Those parsing subtrees will be necessary for the L-value and for the R-value part of the righthand subexpression. Matching equivalence is wasteful for a compiler to analyse during compile time and it's even more wasteful for an interpreter at runtime. Supporting something like a += b for strings would be nice. A problem I see is that this gives the impression that the primitive concatenation operator would be '+', but that isn't the case in awk. > > Also, I suspect the memory manipulation that's going on internally to > allow the string produced by the array access on the left side of the = > sign to become longer after the concatenation probably dwarfs any run-time > impact of accessing the array on the right side of the = sign. The point is that in case of a 3-address operation you typically always waste time and space for copying the objects, while in the "append" case you lose that gain only if you have implemented some too primitive memory allocation method and your available prepeared capacity for the object gets constantly reallocated in size. (But even a reallocation can be efficient, depending on the underlying library function.) Janis > > Ed. >> >> thank you for listening, >> - Anand >> > > > Posted using www.webuse.net [*] AFAIR, in COBOL, there were statements like: multiply A by B given C |
|
|||
|
On 2/23/2012 6:42 AM, Janis Papanagnou wrote:
> Am 22.02.2012 18:27, schrieb Ed Morton: >> Anand Hariharan<mailto.anand.hariharan@gmail.com> wrote: >> >>> Should the element of an associative array be numeric, one can use >>> operators like += or ++ to avoid duplicate lookups. >>> >>> If it were a string instead, how does one do concatenate another >>> string to it? >>> >>> foo[$1, $2] = foo[$1, $2] $3 >>> >>> appears wasteful. >> >> That is how you do it. >> >> Internally the += for adding an int might be doing 2 lookups instead of 1 >> or the above may get optimised by the interpreter to 1 lookup so it's hard >> to say if that syntax procduces code that's wasteful in terms of >> performance or not compared to the version that adds an int. > > Hmm.. - have you any evidence for that? The point is just that you can't tell for sure by the syntax of a language what's happening under the hood. While such implementations > could not be ruled out per se they are also not very likely, I'd say, > since the days when 2-address machines have been introduced. (Though > a scalar 2-address operation will differ compared to the respective > string operation anyway.) > > Also consider that the OP's term of being "wasteful" can also be seen > at the programming level (as opposed to the machine or interpreter > level), and in fact I've understood the posting that way. It's not only > wasteful in typing, also wasteful to confirm correct coding, and prone > to errors from duplicating code. Semantically there's a difference in > "concatenate A to B and put the result in C" (resembling COBOL[*] ;-) > and "append B to A". The latter intention is also easier to understand > than a version with duplicated common subexpressions. > > In the current case we should also assume that an optimization will > not happen on lexicalic level, rather that you have to create parsing > structures resembling (in the given case) the expressions > evaluate constant 1 > access field for constant 1 > evaluate constant 2 > access field for constant 2 > convert field1 to string > convert field2 to string > concatenate field1 and field2 to a tempoprary > address array element with temporary > Those parsing subtrees will be necessary for the L-value and for the > R-value part of the righthand subexpression. Matching equivalence is > wasteful for a compiler to analyse during compile time and it's even > more wasteful for an interpreter at runtime. > > Supporting something like a += b for strings would be nice. > > A problem I see is that this gives the impression that the primitive > concatenation operator would be '+', but that isn't the case in awk. Exactly, you'd need to introduce some new concatenation operator since awk (and those of us reading awk!) uses clues like "+" to tell if it's operating on ints or strings. I agree a "+=" concatenation syntax for strings would be nice, but not worthwhile at all. >> >> Also, I suspect the memory manipulation that's going on internally to >> allow the string produced by the array access on the left side of the = >> sign to become longer after the concatenation probably dwarfs any run-time >> impact of accessing the array on the right side of the = sign. > > The point is that in case of a 3-address operation you typically always > waste time and space for copying the objects, while in the "append" > case you lose that gain only if you have implemented some too primitive > memory allocation method and your available prepeared capacity for the > object gets constantly reallocated in size. (But even a reallocation > can be efficient, depending on the underlying library function.) We've seen some tests here, e.g.: $ time awk '{ str = str $0 } END{ print str }' file100K > /dev/null real 0m0.268s user 0m0.140s sys 0m0.124s $ time awk '{ print $0 }' file100K > /dev/null real 0m0.092s user 0m0.093s sys 0m0.015s that show that string concatenation is slower than I/O operations (which usually get the reputation for being slow) hence the usual advice that printing strings as you go is faster than concatenating them and printing the one resulting string at the END. Ed. > > Janis > >> >> Ed. >>> >>> thank you for listening, >>> - Anand >>> >> >> >> Posted using www.webuse.net > >[*] AFAIR, in COBOL, there were statements like: multiply A by B given C |
|
|||
|
On 2/23/2012 7:33 AM, Ed Morton wrote:
> On 2/23/2012 6:42 AM, Janis Papanagnou wrote: >> Am 22.02.2012 18:27, schrieb Ed Morton: >>> Anand Hariharan<mailto.anand.hariharan@gmail.com> wrote: >>> >>>> Should the element of an associative array be numeric, one can use >>>> operators like += or ++ to avoid duplicate lookups. >>>> >>>> If it were a string instead, how does one do concatenate another >>>> string to it? >>>> >>>> foo[$1, $2] = foo[$1, $2] $3 >>>> >>>> appears wasteful. >>> >>> That is how you do it. >>> >>> Internally the += for adding an int might be doing 2 lookups instead of 1 >>> or the above may get optimised by the interpreter to 1 lookup so it's hard >>> to say if that syntax procduces code that's wasteful in terms of >>> performance or not compared to the version that adds an int. >> >> Hmm.. - have you any evidence for that? > > The point is just that you can't tell for sure by the syntax of a language > what's happening under the hood. > > While such implementations >> could not be ruled out per se they are also not very likely, I'd say, >> since the days when 2-address machines have been introduced. (Though >> a scalar 2-address operation will differ compared to the respective >> string operation anyway.) >> >> Also consider that the OP's term of being "wasteful" can also be seen >> at the programming level (as opposed to the machine or interpreter >> level), and in fact I've understood the posting that way. It's not only >> wasteful in typing, also wasteful to confirm correct coding, and prone >> to errors from duplicating code. Semantically there's a difference in >> "concatenate A to B and put the result in C" (resembling COBOL[*] ;-) >> and "append B to A". The latter intention is also easier to understand >> than a version with duplicated common subexpressions. >> >> In the current case we should also assume that an optimization will >> not happen on lexicalic level, rather that you have to create parsing >> structures resembling (in the given case) the expressions >> evaluate constant 1 >> access field for constant 1 >> evaluate constant 2 >> access field for constant 2 >> convert field1 to string >> convert field2 to string >> concatenate field1 and field2 to a tempoprary >> address array element with temporary >> Those parsing subtrees will be necessary for the L-value and for the >> R-value part of the righthand subexpression. Matching equivalence is >> wasteful for a compiler to analyse during compile time and it's even >> more wasteful for an interpreter at runtime. >> >> Supporting something like a += b for strings would be nice. >> >> A problem I see is that this gives the impression that the primitive >> concatenation operator would be '+', but that isn't the case in awk. > > Exactly, you'd need to introduce some new concatenation operator since awk (and > those of us reading awk!) uses clues like "+" to tell if it's operating on ints > or strings. I agree a "+=" concatenation syntax for strings would be nice, but > not worthwhile at all. > >>> >>> Also, I suspect the memory manipulation that's going on internally to >>> allow the string produced by the array access on the left side of the = >>> sign to become longer after the concatenation probably dwarfs any run-time >>> impact of accessing the array on the right side of the = sign. >> >> The point is that in case of a 3-address operation you typically always >> waste time and space for copying the objects, while in the "append" >> case you lose that gain only if you have implemented some too primitive >> memory allocation method and your available prepeared capacity for the >> object gets constantly reallocated in size. (But even a reallocation >> can be efficient, depending on the underlying library function.) > > We've seen some tests here, e.g.: > > $ time awk '{ str = str $0 } END{ print str }' file100K > /dev/null > > real 0m0.268s > user 0m0.140s > sys 0m0.124s > > $ time awk '{ print $0 }' file100K > /dev/null > > real 0m0.092s > user 0m0.093s > sys 0m0.015s I wondered how much the final "print" in the string concatenation example above was contributing to the time: $ time awk '{ str = str $0 }' file100K > /dev/null real 0m0.242s user 0m0.186s sys 0m0.093s Turned out to be negligible. Ed. > > that show that string concatenation is slower than I/O operations (which usually > get the reputation for being slow) hence the usual advice that printing strings > as you go is faster than concatenating them and printing the one resulting > string at the END. > > Ed. > >> >> Janis >> >>> >>> Ed. >>>> >>>> thank you for listening, >>>> - Anand >>>> >>> >>> >>> Posted using www.webuse.net >> >>[*] AFAIR, in COBOL, there were statements like: multiply A by B given C > |
|
|||
|
[ Previous text deleted ... ]
To sum up, the awk language does not have the equivalent of += for string concatenation. This would have been nice, and would have made sense if awk had initially been given a visible concatenation operator such as & or @. But it's way too late now. Gawk does optimize a = a b assignments, using realloc() to increase the length of `a' and copy in the value of `b'. This only works for scalars though, not array elements. This optimization, at least on linux, makes large string concatenations blazingly fast compared to the simple compute sum of lengths of a and b allocate a new buffer copy in both release a's old buffer assign new one to a which is what gawk did originally. -- Aharon (Arnold) Robbins arnold AT skeeve DOT com P.O. Box 354 Home Phone: +972 8 979-0381 Nof Ayalon Cell Phone: +972 50 729-7545 D.N. Shimshon 99785 ISRAEL |
|
|||
|
On 23.02.2012 14:33, Ed Morton wrote:
> On 2/23/2012 6:42 AM, Janis Papanagnou wrote: [...] >> Supporting something like a += b for strings would be nice. >> >> A problem I see is that this gives the impression that the primitive >> concatenation operator would be '+', but that isn't the case in awk. > > Exactly, you'd need to introduce some new concatenation operator since awk > (and those of us reading awk!) uses clues like "+" to tell if it's operating > on ints or strings. I agree a "+=" concatenation syntax for strings would be > nice, but not worthwhile at all. Well, specifically for the non-trivial case that had been presented in this thread I disagree. In short A = A @ B where A is a non-trivial expression as shown with the arrays. In such cases it is very useful, and not only in the awk language. For numbers the A += B is trivial and for simple string expressions there's - at least in gawk, as Aharon noted in a followup - an optimization in expressions like A = A B . Janis [...] |
|
|||
|
On Feb 23, 4:09*pm, arn...@skeeve.com (Aharon Robbins) wrote:
> [ Previous text deleted ... ] > > To sum up, the awk language does not have the equivalent of += for > string concatenation. This would have been nice, and would have made > sense if awk had initially been given a visible concatenation operator > such as & or @. > > But it's way too late now. > > Gawk does optimize > > * * * * a = a b > > assignments, using realloc() to increase the length of `a' and copy in the > value of `b'. *This only works for scalars though, not array elements. > (...) I agree there isn't a need (or worthwhile) to introduce a new concatenation operator. However, I do not see a problem in providing a new function (I propose to unimaginatively call it 'strcat') as an extension that avoids the issue of the duplicate look-up (both code-maintenance-wise as well as performance-wise). The function could receive the first argument, a scalar, by reference (which is why it has to be built-in), and could have both the side-effect of modifying the first argument as well as returning the concatenated string. Thanks to Ed and Arnold and special thanks to Janis for all your posts. sincerely, - Anand |
|
|||
|
In article <1ad17906-3bee-443e-add3-22ab1ff28bf0@j8g2000yqm.googlegroups.com>,
Anand Hariharan <mailto.anand.hariharan@gmail.com> wrote: >On Feb 23, 4:09*pm, arn...@skeeve.com (Aharon Robbins) wrote: >> [ Previous text deleted ... ] >> >> To sum up, the awk language does not have the equivalent of += for >> string concatenation. This would have been nice, and would have made >> sense if awk had initially been given a visible concatenation operator >> such as & or @. >> >> But it's way too late now. >> >> Gawk does optimize >> >> * * * * a = a b >> >> assignments, using realloc() to increase the length of `a' and copy in the >> value of `b'. *This only works for scalars though, not array elements. >> >(...) > >I agree there isn't a need (or worthwhile) to introduce a new >concatenation operator. > >However, I do not see a problem in providing a new function (I propose >to unimaginatively call it 'strcat') Gawk already has too many extensions. That is a problem. >as an extension that avoids the >issue of the duplicate look-up (both code-maintenance-wise as well as >performance-wise). The function could receive the first argument, a >scalar, by reference (which is why it has to be built-in), and could >have both the side-effect of modifying the first argument as well as >returning the concatenated string. If you really need this, it's best if you write a loadable built-in to do it. Another option is to simply do all your work on a non-array variable and then assign the final value to the array element when you're done. Since gawk reference counts its strings, this remains efficient. >Thanks to Ed and Arnold and special thanks to Janis for all your >posts. You're welcome. Arnold -- Aharon (Arnold) Robbins arnold AT skeeve DOT com P.O. Box 354 Home Phone: +972 8 979-0381 Nof Ayalon Cell Phone: +972 50 729-7545 D.N. Shimshon 99785 ISRAEL |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|