|
|||
|
Roland Pibinger wrote:
> On 17 Oct 2005 13:13:22 -0400, "kanze" <kanze@gabi-soft.fr> > wrote: > >> And what about the possibility of ++ overflowing; >> you'd certainly want to detect the case >> > But who checks for int overflow?? > People who write working code. If you cannot prove that it can't overflow, you check. Most of the time, you can prove that it can't overflow. Either because you've done some plausibility checks on input, or because you've started from some limited set of values to begin with. Physical constraints also come into play: if you're counting key strokes of a single user, for example, in a single session, or seconds since some event in the program (supposing, of course, at 32 bit type for counting). In this case, exceptionally, you might have to consider overflow; we don't know enough about the external context to eliminate the possibility. On my system, if I compile in 64 bit mode, and count on a long, I don't. Or if I link with a library which doesn't support large files, I don't. There can't be more words in a file than there are bytes. But if I am counting on a 32 bit int, and reading from a large file (which can contain w^64 bytes), then the possibility of overflow cannot be neglected. > And what about some basic error handling for ifstream first? I > don't know if the following is sufficient since I (like many > others) use iostreams only for tracing (i.e. the return value > doesn't matter). > > bool words(map<string, int> &m, ifstream &fin) { > for(string s; fin >> s; ++m[s]) { ; } > return fin.eof() && (! fin.bad()); > } > The problem is that iostream doesn't give you enough information to be able to do any serious error checking. And there's no necessity of putting what it does allow in this function; the "standard" idiom is to leave checking for the end, and the responsibility for the checking to the code which created the stream to begin with. Unlike overflow, the error won't be lost. -- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|
||||
|
||||
|
|
|
|||
|
kanze wrote:
> Roland Pibinger wrote: > > > On 17 Oct 2005 13:13:22 -0400, "kanze" <kanze@gabi-soft.fr> > > wrote: > > > > > >> And what about the possibility of ++ overflowing; > >> you'd certainly want to detect the case > > > > But who checks for int overflow?? > > People who write working code. If you cannot prove that it > can't overflow, you check. > > Most of the time, you can prove that it can't overflow. Either > because you've done some plausibility checks on input, or > because you've started from some limited set of values to begin > with. Physical constraints also come into play: if you're > counting key strokes of a single user, for example, in a single > session, or seconds since some event in the program (supposing, > of course, at 32 bit type for counting). > > In this case, exceptionally, you might have to consider > overflow; we don't know enough about the external context to > eliminate the possibility. On my system, if I compile in 64 bit > mode, and count on a long, I don't. Or if I link with a library > which doesn't support large files, I don't. There can't be more > words in a file than there are bytes. But if I am counting on a > 32 bit int, and reading from a large file (which can contain > w^64 bytes), then the possibility of overflow cannot be > neglected. You don't have to prove that overflow cannot occur; you only have to estimate the probability of overflow being lower than the probability of a hardware failure or a meteor strike. In this case, given a 32-bit int, you'll need a very unrealistic input file (X X X X X ...) that's at least 4Gb long, and parsing that with an istream would likely be slow enough as to make the user kill the application at the 10% mark. :-) [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
On 2005-10-20 18:46, Peter Dimov wrote:
> kanze wrote: : >> In this case, exceptionally, you might have to consider overflow; >> we don't know enough about the external context to eliminate the >> possibility. On my system, if I compile in 64 bit mode, and count >> on a long, I don't. Or if I link with a library which doesn't >> support large files, I don't. There can't be more words in a file >> than there are bytes. But if I am counting on a 32 bit int, and >> reading from a large file (which can contain w^64 bytes), then the >> possibility of overflow cannot be neglected. > > You don't have to prove that overflow cannot occur; you only have to > estimate the probability of overflow being lower than the > probability of a hardware failure or a meteor strike. > > In this case, given a 32-bit int, you'll need a very unrealistic > input file (X X X X X ...) that's at least 4Gb long, and parsing > that with an istream would likely be slow enough as to make the user > kill the application at the 10% mark. ![]() What if someone wants to count the words in (say) his multi-gigabyte genetic-code file, and doesn't care that it takes a couple of hours or days? It's okay if the program doesn't support this, but it shouldn't silently produce incorrect results (or crash). FWIW, I'm currently working on a program which parses and processes (potentially) multi-gigabyte files with textual data, and although the processing is significantly more complex than just parsing and counting words, it takes less than two minutes per gigabyte on my machine. I certainly make sure to check for overflow on all relevant integer variables. -- Niklas Matthies [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
Niklas Matthies wrote:
> What if someone wants to count the words in (say) his multi-gigabyte > genetic-code file, and doesn't care that it takes a couple of hours > or days? It's okay if the program doesn't support this, but it > shouldn't silently produce incorrect results (or crash). That someone would probably need to use map<string, unsigned long long> to store the results, so our function is of little use and adding overflow checks will not suddenly make it the correct choice. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
On 2005-10-21 16:23, Peter Dimov wrote:
> Niklas Matthies wrote: > >> What if someone wants to count the words in (say) his multi-gigabyte >> genetic-code file, and doesn't care that it takes a couple of hours >> or days? It's okay if the program doesn't support this, but it >> shouldn't silently produce incorrect results (or crash). > > That someone would probably need to use map<string, unsigned long long> > to store the results, so our function is of little use and adding > overflow checks will not suddenly make it the correct choice. It seems you didn't really read the last sentence I wrote above. -- Niklas Matthies [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
Peter Dimov wrote:
> kanze wrote: > > Roland Pibinger wrote: > > > On 17 Oct 2005 13:13:22 -0400, "kanze" <kanze@gabi-soft.fr> > > > wrote: > > >> And what about the possibility of ++ overflowing; you'd > > >> certainly want to detect the case > > > But who checks for int overflow?? > > People who write working code. If you cannot prove that it > > can't overflow, you check. > > Most of the time, you can prove that it can't overflow. > > Either because you've done some plausibility checks on > > input, or because you've started from some limited set of > > values to begin with. Physical constraints also come into > > play: if you're counting key strokes of a single user, for > > example, in a single session, or seconds since some event in > > the program (supposing, of course, at 32 bit type for > > counting). > > In this case, exceptionally, you might have to consider > > overflow; we don't know enough about the external context to > > eliminate the possibility. On my system, if I compile in 64 > > bit mode, and count on a long, I don't. Or if I link with a > > library which doesn't support large files, I don't. There > > can't be more words in a file than there are bytes. But if > > I am counting on a 32 bit int, and reading from a large file > > (which can contain w^64 bytes), then the possibility of > > overflow cannot be neglected. > You don't have to prove that overflow cannot occur; you only > have to estimate the probability of overflow being lower than > the probability of a hardware failure or a meteor strike. Yes and no. What you need to do is to ensure that the program can't run long enough to encounter the error. Beyond a certain length of time, the probability of hardware failure stopping execution approaches certainty, so you're OK. Similarly, if you are counting external events, there will be a lot of limits which can be taken into account -- I won't bother checking for overflow when counting keystrokes on a 32 bit int, for example. > In this case, given a 32-bit int, you'll need a very > unrealistic input file (X X X X X ...) that's at least 4Gb > long, and parsing that with an istream would likely be slow > enough as to make the user kill the application at the 10% > mark. :-) Unless he runs the program overnight? You'd be surprised how fast some modern machines can read 4 GB (or more). Of course, verifying that the input file size is less that 8GB before starting would eliminate the need for overflow checking. -- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
Niklas Matthies wrote:
> On 2005-10-21 16:23, Peter Dimov wrote: > > Niklas Matthies wrote: > > > >> What if someone wants to count the words in (say) his multi-gigabyte > >> genetic-code file, and doesn't care that it takes a couple of hours > >> or days? It's okay if the program doesn't support this, but it > >> shouldn't silently produce incorrect results (or crash). > > > > That someone would probably need to use map<string, unsigned long long> > > to store the results, so our function is of little use and adding > > overflow checks will not suddenly make it the correct choice. > > It seems you didn't really read the last sentence I wrote above. Yes I did, I just didn't want to dispute it. You, too, didn't respond to my sentence above. But since you insist: I ran the function on a 2Gb "X X X X ..." file and it took 13m 18s, and under normal circumstances I would have killed it at the 10% mark, so my totally uneducated guess was so good that I even surprised myself. I fully expected to be wrong on this one. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
Peter Dimov wrote:
> Niklas Matthies wrote: >>What if someone wants to count the words in (say) his >>multi-gigabyte genetic-code file, and doesn't care that it >>takes a couple of hours or days? It's okay if the program >>doesn't support this, but it shouldn't silently produce >>incorrect results (or crash). > That someone would probably need to use map<string, unsigned > long long> to store the results, so our function is of little > use and adding overflow checks will not suddenly make it the > correct choice. No, but it will detect the error, rather than return wrong results. -- James Kanze mailto: james.kanze@free.fr Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 pl. Pierre Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
kanze wrote:
> Peter Dimov wrote: > > In this case, given a 32-bit int, you'll need a very > > unrealistic input file (X X X X X ...) that's at least 4Gb > > long, and parsing that with an istream would likely be slow > > enough as to make the user kill the application at the 10% > > mark. :-) > > Unless he runs the program overnight? You'd be surprised how > fast some modern machines can read 4 GB (or more). Even if someone is prepared to run the program overnight (actually wait half an hour), he'll still like to see a progress indicator, and our function provides none. This is a strong indication that it's not meant to operate on 4+ Gb files. The int in the interface is another giveaway. :-) Also consider what happens when the 4Gb file contains unique words (or is binary) on a typical machine with, say, 512Mb physical memory. You'd probably need to add a check for the approximate memory used by the map in addition to checking for overflow, and perhaps refrain from using ifstream and string extraction altogether (it's possible that the 4Gb file contains no whitespace). [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
On 2005-10-22 15:19, Peter Dimov wrote:
> Niklas Matthies wrote: >> On 2005-10-21 16:23, Peter Dimov wrote: >> > Niklas Matthies wrote: >> > >> >> What if someone wants to count the words in (say) his multi-gigabyte >> >> genetic-code file, and doesn't care that it takes a couple of hours >> >> or days? It's okay if the program doesn't support this, but it >> >> shouldn't silently produce incorrect results (or crash). >> > >> > That someone would probably need to use map<string, unsigned long long> >> > to store the results, so our function is of little use and adding >> > overflow checks will not suddenly make it the correct choice. >> >> It seems you didn't really read the last sentence I wrote above. > > Yes I did, I just didn't want to dispute it. You, too, didn't respond > to my sentence above. I didn't because I can't see how it makes sense in the context of what I wrote. I explicitly acknowledged that the program doesn't support inputs that contain too many occurrences of a particular word. But how do you expect the user to be able to tell whether his input exceeds the program's limits? The user would have to count the words in his input--which is exactly the program's job. > But since you insist: I ran the function on a 2Gb "X X X X ..." file > and it took 13m 18s, and under normal circumstances I would have killed > it at the 10% mark, so my totally uneducated guess was so good that I > even surprised myself. I fully expected to be wrong on this one. But not every user will kill the application. Users which are aware that processing large files can take a bit longer will be willing to wait, in particular if the result is important to them. Now if those users use the version with no overflow checks, then (assuming wrap-around on overflow) the application will finally be done with the processing, and if overflow occurred will silently output an erroneous result, with _no_ indication to the user that the reported result is not correct (unless the result happens to be in the negative range). Do you find that acceptable? I certainly wouldn't. -- Niklas Matthies [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
On 2005-10-22 13:57, kanze wrote:
: > Of course, verifying that the input file size is less that 8GB > before starting would eliminate the need for overflow checking. Only if the input file size can be checked at all (think of named pipes and device files) and the file was opened in an exclusive mode which guarantees that it won't grow while reading it. These aren't assumptions I would make on the level of the 'words' function. -- Niklas Matthies [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
Peter Dimov wrote:
> Niklas Matthies wrote: >>On 2005-10-21 16:23, Peter Dimov wrote: >>>Niklas Matthies wrote: >>>>What if someone wants to count the words in (say) his >>>>multi-gigabyte genetic-code file, and doesn't care that it >>>>takes a couple of hours or days? It's okay if the program >>>>doesn't support this, but it shouldn't silently produce >>>>incorrect results (or crash). >>>That someone would probably need to use map<string, unsigned >>>long long> to store the results, so our function is of little >>>use and adding overflow checks will not suddenly make it the >>>correct choice. >>It seems you didn't really read the last sentence I wrote above. > Yes I did, I just didn't want to dispute it. You, too, didn't > respond to my sentence above. > But since you insist: I ran the function on a 2Gb "X X X X > ..." file and it took 13m 18s, and under normal circumstances > I would have killed it at the 10% mark, so my totally > uneducated guess was so good that I even surprised myself. I > fully expected to be wrong on this one. We must have different definitions for "under normal circumstances". I regularly run jobs that last for an hour or so. With at, overnight, if I can, but from time to time directly from the command line. If I invoke a command on a couple of Gigabytes of data, and it finishes in less than an hour, I'm surprised. -- James Kanze mailto: james.kanze@free.fr Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 pl. Pierre Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
Peter Dimov wrote:
> kanze wrote: >>Peter Dimov wrote: >>>In this case, given a 32-bit int, you'll need a very >>>unrealistic input file (X X X X X ...) that's at least 4Gb >>>long, and parsing that with an istream would likely be slow >>>enough as to make the user kill the application at the 10% >>>mark. :-) >>Unless he runs the program overnight? You'd be surprised how >>fast some modern machines can read 4 GB (or more). > Even if someone is prepared to run the program overnight > (actually wait half an hour), he'll still like to see a > progress indicator, Where? The program is running in background. It's not connected to a terminal, or a window, or anything else which would support a program indicator. I rather regularly run different applications overnight. Using the command at to do so -- normally, I won't even be logged in on the machine where the program is running. Note that a certain number of programs that I write are also run as cron jobs -- they're automatically started at a specific time, say once a week, 3am Sunday morning. I could very well imagine that in some contexts, counting words could be part of such a job. I hate programs which give Mickey Mouse like progress indicators unless specifically asked. > and our function provides none. This is a strong indication > that it's not meant to operate on 4+ Gb files. The int in the > interface is another giveaway. :-) The problem isn't whether it was meant to operator on 4+ Gb files. The problem is what it does if given such a file. Just giving a wrong answer is NOT an acceptable solution; it's worse than crashing. (In fact, I can very much imagine that an assertion failure would be an acceptable solution in many of the contexts where I work.) > Also consider what happens when the 4Gb file contains unique > words (or is binary) on a typical machine with, say, 512Mb > physical memory. You'd probably need to add a check for the > approximate memory used by the map in addition to checking for > overflow, and perhaps refrain from using ifstream and string > extraction altogether (it's possible that the 4Gb file > contains no whitespace). Obviously, I've either replaced to new_handler to abort with an error message (insufficient resources, or the like), or I'm prepared to handle bad_alloc at a higher level. Again, aborting, or simply saying that the job cannot be done, is almost always preferrable to an incorrect result. -- James Kanze mailto: james.kanze@free.fr Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 pl. Pierre Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
James Kanze wrote:
> Peter Dimov wrote: > > But since you insist: I ran the function on a 2Gb "X X X X > > ..." file and it took 13m 18s, and under normal circumstances > > I would have killed it at the 10% mark, so my totally > > uneducated guess was so good that I even surprised myself. I > > fully expected to be wrong on this one. > > We must have different definitions for "under normal > circumstances". I regularly run jobs that last for an hour or > so. With at, overnight, if I can, but from time to time > directly from the command line. If I invoke a command on a > couple of Gigabytes of data, and it finishes in less than an > hour, I'm surprised. You keep going back to the general, despite my efforts to return to the particular function in question. Adding integer overflow checks does not make it suitable for gigabyte inputs, even if we assume that it has been designed to be run overnight in non-interactive mode. The function simply does not work in this case. It may fail silently, or it may fail loudly, but it still fails. Using this function in an application that can operate on gigabyte inputs is a programming error. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|||
|
Niklas Matthies wrote:
> On 2005-10-22 15:19, Peter Dimov wrote: > > Niklas Matthies wrote: > >> On 2005-10-21 16:23, Peter Dimov wrote: > >> > Niklas Matthies wrote: > >> > > >> >> What if someone wants to count the words in (say) his multi-gigabyte > >> >> genetic-code file, and doesn't care that it takes a couple of hours > >> >> or days? It's okay if the program doesn't support this, but it > >> >> shouldn't silently produce incorrect results (or crash). > >> > > >> > That someone would probably need to use map<string, unsigned long long> > >> > to store the results, so our function is of little use and adding > >> > overflow checks will not suddenly make it the correct choice. > >> > >> It seems you didn't really read the last sentence I wrote above. > > > > Yes I did, I just didn't want to dispute it. You, too, didn't respond > > to my sentence above. > > I didn't because I can't see how it makes sense in the context of what > I wrote. I explicitly acknowledged that the program doesn't support > inputs that contain too many occurrences of a particular word. But how > do you expect the user to be able to tell whether his input exceeds > the program's limits? The user would have to count the words in his > input--which is exactly the program's job. The user is not important here. The decision to add overflow checks at the lowest level is made by the author of the function. Since the function doesn't provide progress feedback and returns its results in an int, it is reasonable to infer that this function is not meant to block for fifteen minutes or be used on inputs that can overflow an int. If the author of the higher level code expects his program to be used on large data sets, he should simply not use our function for a variety of reasons, only one being integer overflow. > > But since you insist: I ran the function on a 2Gb "X X X X ..." file > > and it took 13m 18s, and under normal circumstances I would have killed > > it at the 10% mark, so my totally uneducated guess was so good that I > > even surprised myself. I fully expected to be wrong on this one. > > But not every user will kill the application. Users which are aware > that processing large files can take a bit longer will be willing to > wait, in particular if the result is important to them. Blocking operations that provide no feedback are (arguably) even less acceptable than not checking for integer overflow. So is bringing the whole machine to a screeching halt because of paging. > Now if those > users use the version with no overflow checks, then (assuming > wrap-around on overflow) the application will finally be done with the > processing, and if overflow occurred will silently output an erroneous > result, with _no_ indication to the user that the reported result is > not correct (unless the result happens to be in the negative range). > Do you find that acceptable? I certainly wouldn't. It might or might not be acceptable depending on the circumstances. The point is that when it isn't, it is probably not the fault of the author of the low-level function. It is a reasonabale policy for low-level functions to only check for (or better yet, avoid) integer overflow in their intermediate results. The responsibility of calling the correct low-level function can be left to the outer layer. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| SCL Joe (was RE: macro structure) | Gregg P. Snell | Newsgroup comp.soft-sys.sas | 0 | 06-27-2006 07:59 PM |
| Re: How to find the number of lines of SAS code and comments | Doug Rohde | Newsgroup comp.soft-sys.sas | 0 | 05-06-2005 09:13 PM |
| Documenting sas programs | nevin | Newsgroup comp.soft-sys.sas | 2 | 02-05-2005 08:58 AM |