Sunday, March 6, 2011

Reading files

In case I have any regular readers, I feel obliged to point out that this is somewhere between a rant and (we'll hope) just being aimed at somebody other than you.

In the last week or so, I think I've seen at least a dozen (okay, probably only half that) posts on StackOverflow that included loops something like the following1:


while (!somestream.eof()) {
    read_data();
    process_data();
}


Let me be clear about this: a loop like this just doesn't work. It's possible (though not particularly easy) to add enough other conditions to exit from the loop at the right time, but if you do it will always2 be those other conditions that end the loop, not the loop condition itself. The loop condition itself can't exit the loop at the right time. The problem is that "whatever_stream.eof()" only becomes true after a read fails because you've reached the end of the file. After you've read the last item from the file, "eof()" remains false. You then try to read another item, which fails, but your code executes the rest of the loop, which processes the data as if the read had succeeded. The typical result is that your code appears to read the last item from the file twice.

At least with the way the stream conditions work in C and C++, the simplest way to write a loop that works is something like this instead:



while (read succeeded)
   process data


This can be expressed in any of a number of ways. In C, one common possibility looks something like this:


while (fgets(file, buffer, buffer_len))
    process(buffer);


Another common possibility is for reading one character at a time:


while (EOF != (ch=getchar()))
    process(ch);


A slightly less common version looks something like this:


while (fread(file, buffer, items, item_size))
    process(buffer);


A rough equivalent to the first in C++ looks something like this:


while (std::getline(inputfile, somestring))
    process(somestring);


Another possibility in C++ checks the state of the stream after an extraction operator:


while (infile >> some_data)
    process(some_data);


Perhaps the cleanest method in C++ is to use a standard algorithm along with an istream_iterator to create the loop implicitly. For one example:


std::transform(std::istream_iterator(infile),
               std::istream_iterator(),
               std::back_inserter(some_collection),
               operation);


Any of these can and will work. Personally, I prefer to use an algorithm in most cases, but I'll admit that's a matter of preference, not a requirement. The other loops that test the result of a read can and will do the job perfectly well. Unfortunately, the one using `.eof()` cannot, will not and doesn't work -- pretty much ever. It can't, it won't, and it doesn't.

I've seen a few posts that basically tried to argue that with enough patches it's possible to make it sort of work, or cover up for the fact that it doesn't work. Most of these seem to miss one simple fact though: the easiest way to get from the code that doesn't work to code that does work is to throw out the original loop and start over. Trying to patch together something that even sort of works part of the time is more work than just starting over and writing the code correctly from the beginning.

Whether you can patch them together or not, let me paraphrase C.A.R. Hoare: don't make it so complex that it's not obviously incorrect. Make it so simple that it's obviously correct. If you follow one of the patterns above, correctness will be obvious to almost anybody and everybody who knows what they're looking at. Starting with a loop of the form while (!whatever.eof()) can work -- but even at best, almost nobody will really know it works, at least without a lot of extra work to be sure they've caught every possible corner case.

1 Of course, in reality, read_data() and process_data() were never presented as functions, and in most cases the loop bodies were fairly long and complex -- but let's ignore that for the moment.
2 Okay, I suppose there might be some exception to this, but if so I don't know what it is.