Coder's Central: 2011

Monday, September 19, 2011

Still more about version control

This is (more or less) a counterpoint to my last post. For those who found it forgettable (or unreadable), the primary point of my last post was that version control systems combine too many separate (and mostly unrelated) components into a single package. Instead of being built as a single package, a number of components should be specified, to give users the ability to combine the pieces as necessary to fit their needs.

This post is mostly about why that hasn't happened (not only with version control, but with many similar tools as well), and even if it did, the result would probably be pointless and useless. The short summary of the reason is that nobody has designed a usable way to specify/define interfaces between components like this that's independent of both language and underlying operating system.

There are certainly a number of candidates, including CORBA, COM and CLI, Java Beans, Linux D-Bus, etc. Each of these, however, has problems, at least from my perspective.

CORBA is almost a dead issue -- it's been around for decades now, and never achieved enough market share to really matter to more people. It also imposes a fair amount of overhead in many cases, ORBs are often fairly expensive, etc. While it's sometimes used for really big systems, for something like version control it's probably overkill even at best.

COM might be a (halfway) reasonable answer on Windows, but it's really only native and (even reasonably) well supported on Windows. On some other systems there's some support, but it's generally only to support some sort of interoperation with Windows or Windows applications. CLI was (to some extent) intended to be a successor to COM, and to some extent it is. CLI does make it relatively easy to design interfaces between components that (for example) are written in different .NET languages. That is an important qualifier though -- while it's easy to support components in different .NET languages, supporting a component in a language that isn't directly supported on .NET is doubly difficult.

Java Beans are obviously Java-only. At least in my opinion, they're a lousy design anyway.

Linux D-Bus has pretty much the same problems as COM except that its native system has a much smaller market share.

One of the big problems (as I see it) is that nearly every time somebody tries to define something like this, they look at data being marshalled between components, and realize that this data could just about as easily be sent over a network connection as directly from one component to another. Even though the observation is true in itself, I will posit that this is a mistake for at least two reasons.

The first reason is that even though it's true that you can send the data over a network connection just about as easily as between components on one machine, supporting such a distributed system in a meaningful way requires a lot more than just that. Suddenly, you add in requirements for authentication/security, locating a server for a particular service, etc. Solving these, in turn adds still more problems. Just for example, security often involves keys that are only valid for a limited period of time. That, in turn, requires synchronizing all the clocks in the network.

The second reason is that even though from a purely conceptual viewpoint, it doesn't matter whether a function is being called on machine A or machine B, from a practical viewpoint it often does. A network connection imposes much greater overhead than a normal function call -- enough more that you nearly need to take the difference in overhead into account when designing a system. If you're going to invoke a service on some machine that may be halfway across the country, connected by a dozen hops with an overhead in the range of tens or hundreds of milliseconds, you need to ensure that you make a *few* function calls, each of which accomplishes a great deal. Even via some neutral interface, if the caller and callee are on the same computer, we can expect the overhead of calling a "remote" function to be *much* lower -- nanoseconds to (possibly) low microseconds, not tens to quite possibly hundreds of milliseconds.

These existing systems are poorly suited to the relatively small, simple, types of systems I have in mind. I do think something better could be done (or maybe just popularized -- D-Bus is actually quite good). I'm a lot less certain that anybody's putting much (if any) real effort into doing it though. Bottom line: for most practical purposes, you can't do it now, and probably won't be able to very soon either.

Sunday, September 11, 2011

More about version control

Over the last few weeks, I've been thinking a little more deeply about version control, and why I see it as a problematic situation at the present time.

After some thought, I've decided that the problem is that version control has been (and currently is) primarily oriented toward (what are intended to be) complete products. I think, instead, development in version control should be oriented toward components, with some defined protocols for those components to use when they talk to each other.

Defining these protocols would be non-trivial, but I think doing so would go a long ways toward curing what I see as problems with version control at the present time. It take virtually no looking at all to find multitudes of opinions about the strengths and weaknesses of various version control systems -- Git has a great merge engine, but a very expert-oriented UI and is really only usable on Linux. Mercurial is more friendly and portable, but doesn't have as good of a merge engine. Depending on the kinds of changes involved, one might be substantially faster than the other at propagating changes from one repository to another. Windowing systems for either (to the extent they're available at all) work by generating command lines and spawning the appropriate commands. Virtually every reasonably modern version control system defines at least one protocol to make repositories available via a network. Even though almost everybody agrees that distributed version control is generally a better long-term solution, many still argue that Subversion is enough simpler to justify its use in many smaller shops.

IMO, this leads to a problematic situation: a version control system includes so many parts with such minimal relationships with each other, that virtually every one of them has quite a few shortcomings for nearly every real situation (nearly the only obvious possible exception being Git for Linux kernel development).

Now, as I see it, a version control system is primarily a file system. Where a "normal" file system sees each file as a single item, in a version control system each file may actually be multiple items. In some cases, those items form essentially a single line over time, so there's a single current file at any given point in time. In other cases, "a file" may include multiple "threads", each of which has a current version at any given point in time.

At the very least, this means a version control system needs not only a way of specifying a file, but a point along the timeline for that file. The versions that support multiple branches of a file also need a way of specifying which branch is desired. For convenience, we want some way to do one or both for an entire (often quite large) group of files.

To support the file system, you need at least a few components: the most basic and universal is a differencing engine -- something to look at two versions of a file, and figure out some (hopefully minimal) set of changes to transform one file into the other. This lets us store multiple versions of a file as a single file with multiple (usually fairly small) sets of changes instead of lots of copies of the whole file.

Another that's increasingly common and important is more or less the opposite: a merge engine. Given some starting file and some sets of changes, an ability to merge those changes into a single "current" file automatically finds all the changes that don't conflict with each other, and includes them all, and makes it easy for a person to select which changes to include when there are conflicts.

As mentioned above, for a reasonably modern system we also normally want some way to make a repository available over a network -- possibly only an internal network, but for many purposes, making it directly available to external users via the Internet is useful as well.

In the current situation, all of those (mostly unrelated) parts are embodied in a single product. Somebody who wants to specialize in text merging has to spend a fair amount of time getting to know one code base and how its merge engine talks to the rest of the system before he can make his ideas usable with one system. Having done that his ideas remain useful only for that one system, so if somebody wants to do the same cool stuff with another system, it's probably going to take a lot of work learning both systems well enough to figure out how that code works, and how it'll need to work differently in the new system, and translate from one to the other. It's enough work that even if somebody does it once, chances are pretty good that it'll lead to forked development, with the two having to be maintained in parallel from that point forward. The same, of course, is true of all the other pieces and parts -- the diff engine, the network protocols, etc.

When looking at actual code, most programmers agree on things like "single responsibility principle", "Don't Repeat Yourself (DRY)", etc. Within a single code base, most of us apply these (albeit with varying degrees of rigor). Strangely, however, applying these same principles at a larger scale seems quite unusual -- and that's exactly what I see here. Practices we wouldn't even consider tolerating within a code base, are completely normal and expected across code bases.

Now, it is certainly true that version control is not unique in this respect. Early in the history of (almost?) any kind of tool, it's common to see more or less exploratory programming, where people with all sorts of different ideas do all sorts of radically different things. Over time, however, the field tends to narrow to the point that it's reasonable to standardize many of those -- and explicitly state what parts are being treated as abstractions with standardized interfaces but implementation open to variation. Looking at the "field", however, I think version control has reached the point that it's ripe for standardization.

Given the degree to which version control is taken as the sine qua non of software engineering in general (e.g., one of the questions on the "Joel Test" is "Do you use source control?") it seems...interesting that little has been done to apply the principles of software engineering not only to the code for a specific version control product, but to version control as a whole.

Friday, July 8, 2011

How to have a fanatical user base.

This is (at least) as much a rant as anything really informative. It's not really about writing code either, but about designing and packaging code so it stands a chance of being liked and used instead of frustrating and angering people who want to use it. Since I happen to be frustrated with it at the moment, I'm going to use TortoiseHg as an example.

The idea of TortoiseHg is great: a GUI front-end for the Mercurial version control system. That seems great. In reality, however, it has a few design flaws and a lot of problems with packaging that render it an absolute nightmare.

Let's consider my current situation (largely because I think it's probably fairly typical for a lot of people). I've programmed for a long time. I've used various version control systems over the years. After reading about the coolness of distributed version control, I've decided it's time to switch. I've done some research and after reading a lot about the pros and cons of each, finally decided to install Mercurial.

I've also set up an account on Bitbucket, and want to be able to synchronize (at least some of) my repositories to/from there, so 1) I have an offsite backup of that code, 2) I can share it with other people, and 3) I can easily get to it when I'm no the road.

Since I do most of my programming on Windows, and quite a bit of it in Visual Studio, I decide to download the TortoiseHg package. This includes Mercurial itself and a graphical front end (for those who don't know, TortoiseXxx are all version control front-ends -- TortoiseSVN, TortoiseGit, TortoiseHg, etc.)

Anyway, running the installer goes pretty smoothly. After it's done, I can start up with TortoiseHg Workbench without any problems either. On the positive side, I guess I should note that this shows real progress over the situation a few decades ago, when it probably would have taken most of a day to accomplish even that much.

I've read the (seemingly) excellent HgInit web site. It tells me my first step is to create a repository, which seems reasonable enough. It tells me: "And getting a repository is super-easy: you just go into the top-level directory where all your code lives, [...] and you type 'hg init'." Since I'm using TortoiseHg, I get to pick my directory with a normal selection dialog and such, but otherwise that step goes smoothly enough too.

From there, thing go downhill pretty quickly though. After reading a bit more, I realize that the next step is going to be to commit all the files in that directory to the version control. Maybe I'm just a wimp, but after re-checking and noting that the directory I've selected contains over 13 gigabytes, 10,000 directories and about 85,000 files, I think about things for a moment and decide that maybe it would be better to start out with something a bit smaller.

Unfortunately, it appears I've made a mistake that's relatively difficult to fix. It's easy to "remove that repository from the registry", but that still leaves a ".Hg" directory that I don't want, and its permissions have been locked down to the point that attempting to delete it fails, even though I'm logged in as an administrator. I eventually do manage to delete it, but I have to change the permissions to give myself full control and remove the read-only bit on it before I can.

This leads to my first point: especially when a new user is trying to get started with your software, ignore the possibility that they might make mistakes. Furthermore, ensure that recovering from such mistakes is as difficult as possible. This gets you started toward a fanatical user base by filtering out all but the most dedicated almost immediately.

Anyway, after a bit more thought, I decide to start out with something smaller -- somebody has asked me to share a particular program with them, and I've set up a repository on Bitbucket (which also uses Mercurial) for it. I'll just set up a repository for that particular code for now, and once I've got things figured out a bit better, I'll deal with the rest of the code.

Of course, when I'm trying to do this, I mis-type the name of the directory where my code lives, so it creates a new directory and puts a repository in it, so I get to do the same thing all over again -- five more minutes wasted to recover from a simple one-letter typo (largely because for some insane reason, it decides to always start browsing for the right directory from my Windows system directory, which hardly seems like a likely place for a version control repository).

Okay, after that I finally manage to create a repository for the correct directory. I even manage to add the files without too much incident. My next step is to get this code on to Bitbucket. When I created the repository on Bitbucket, it gave me a path, so this should be easy, right?

Wrong! First, I seem to have a couple of options: I can push the changes from my local repository, or I can clone my local repository. Or, more accurately, both of those *seem* to be available, but neither one actually works. It appears that TortoiseHg will only let me clone my repository to a local path. When I paste in the path Bitbucket gave me and click the clone button, it tells me: "Error Creating destination folder. Please specify a different path."

This leads to my second major point: abominable error messages. About all this tells me is that it didn't work and I should try something else. Instead of providing any help about how to do things right, it's recommending that I use trial and error until I find something that at least seems to work. Again, I'll add a historical note: a few decades ago, it would have been pretty common for this to be overtly abusive (words like "stupid" or "idiot" in error message used to be pretty common), whereas now the abuse is tacit, covert, and even politely phrased. I believe this is real progress -- users were sometimes amused by over-the-top abusive error messages, where the politely unhelpful one can only be frustrating. Again, this helps a great deal in filtering out any users who might not be absolute fanatics.

Fortunately, the recommended trial and error doesn't seem to accomplish much. At least as far as I can tell so far, the only path that will work is to a local directory. "Distributed" version control that only works on a local machine is definitely a good step toward fanaticism.

Okay, maybe it'll be better to push the changes from this repository to the remote one instead. There's even a nice button that says it will "push outgoing changes to selected URL". That seems reasonable -- but clicking it simply tells me: "No URL Selected. An URL must be selected for this operation." While that seems like a reasonable idea, it tells me nothing about where or how to actually select the URL to use for this operation. Doing a bit of looking at the Settings dialog doesn't turn up anything either.

Doing a bit more clicking around, I find when I right click on the line for the current revision of the code, I get a popup menu with an "Export" sub-menu, but only one of the entries there ("copy patch") is enabled. That doesn't seem terribly promising, but what the heck -- nothing else has worked. Unfortunately, selecting it doesn't seem to do anything useful either (in fact it's not at all apparent what, if anything, it has done at all).

This leads to another obvious major point: when there's a problem, you definitely do not want to provide any information that might help fix it. It's definitely better to leave the user wondering around, helplessly flailing until they run across something that seems like it might work. If you can manage to at least implicitly blame the user, that's definitely a step in the right direction as well.

Well, maybe it's easiest to just ignore the GUI front end for the moment, and use the command line tools instead. Right clicking on the name of the repository pops up a menu that shows (among other things) a terminal entry with the icon for a normal command prompt. I'll just use that, type in the command -- that's not such a big problem, right?

Wrong again! Clicking on that menu entry does open a command prompt (good) and the current directory is even the directory for that code (better). Unfortunately, when I try to give any Mercurial command (they all start with 'hg') I get the standard error message from the command processor saying that:

'hg' is not recognized as an internal or external command,
operable program or batch file.

That's right: I'm at the Mercurial command prompt, but they haven't bothered to put Mercurial itself on my path! To execute a mercurial command at all, I have to figure out/remember where Mercurial is installed, and add that directory to my path manually (and I apparently have to re-do that every time I'm going to use it, too).

That leads to another major point: stupid roadblocks. Adding the right directory to the path epitomizes what people hate about software. You definitely want to keep the user occupied with some tedious, nitpicky details rather than ever get a chance to actually accomplish anything. Knowing the right directory name to add to the path becomes like the secret decoder ring that separates those "in the know" from the unwashed masses of outsiders.

To reiterate a few of the major points: first, ensure that only the absolute most dedicated potential user is going to work his way past all this nonsense to ever actually become a user of the product at all. If an "average" user survives even half the hurdles you've placed in their way, you've failed.

The few people who work their way past all the obstacles and get the thing to work are going to be proud of it. It's basically similar to a college initiation. Even though they won't directly admit it to outsiders, knowing that most (if not all) your fellow users have survived an ordeal similar to your own builds camaraderie.

Finally, knowing all the tricks necessary to use a package effectively tends to separate people into "us" and "them". Experienced users of the package undoubtedly recognize immediately at least a half dozen "stupid" errors in what I did above, and can probably describe exactly how it's easy to do what I wanted.

Being a bit more serious for a moment, while these lead to a fanatical user base, they also lead to a small one. I will posit that I'm not a particularly stupid person. I don't know how to use this package, but the software seems to provide little help for even an intelligent person to progress from ignorant to informed. Quite the contrary, it seems (at least to the ignorant outsider) almost as if it's designed specifically to be as frustrating and difficult as possible.

I should add one last point: although I've picked on TortoiseHg for this particular post, it's far from unique. Many packages have problems similar to those above -- others have problems that are even more severe and more numerous.

Anyway, having gotten that off my chest, I return you to your normally scheduled flaming.

Tuesday, May 24, 2011

Some months ago, there was a bit of hoopla over a blog post by Eric White showing (or at least purporting to show) how C#'s LINQ produced much shorter code than you'd get from writing procedural code. In this case, the example used was writing a minimal hex dump -- take some input from a file, convert the bytes to hexadecimal, and write it out 16-bytes per line.

He says he searched for existing code to do the job, and found that it took around 30 lines of (procedural) code. He then gives the following piece of code using LINQ:


byte[] ba = File.ReadAllBytes("test.xml");
int bytesPerLine = 16;
string hexDump = ba.Select((c, i) => new { Char = c, Chunk = i / bytesPerLine })
  .GroupBy(c => c.Chunk)
  .Select(g => g.Select(c => String.Format("{0:X2} ", c.Char))
      .Aggregate((s, i) => s + i))
  .Select((s, i) => String.Format("{0:d6}: {1}", i * bytesPerLine, s))
  .Aggregate("", (s, i) => s + i + Environment.NewLine);
Console.WriteLine(hexDump);

Perhaps it's just showing my lack of expertise in LINQ, but this does not strike me as particularly transparent code (though, I suppose it may be an improvement on what he found -- it's essentially impossible to say without seeing that code).

As far as line count goes, I guess I'm a bit uncertain. The original blog post makes rather a point of this being only 7 lines of code (in fact "7-lines-of-code" is even part of its URI). At least as it's formatted in the original post, it looks like 9 lines to me though. Maybe this is due to reformatting, or maybe he's not counting the two lines of variable definitions at the beginning (although they do seem to be needed for the code to work).

Unless I'm badly mistaken, you also need to add a few more lines before a C# compiler will accept this code. In particular, I believe you need to put the code into a method (a member function, if you prefer) which must, itself, be part of a class. Despite the "7 lines" claim, by the time you make it actually work, you're probably up around a dozen lines of code or so. I point this out not to impugn any dishonesty on Eric's part, or anything like that, but only to try to set a fair standard for comparison -- if you're going to write a complete program that does this, the standard of comparison isn't 7 lines, but somewhere around 12-14.

Now, it seems like an obvious reaction to this would be to wonder exactly what could be done in procedural code. Using something like Perl (and possibly Python) I'd almost expect this to be a one-liner (a couple of lines at very most). To try to keep things a bit more fair, let's stick to the baseline, "canonical" procedural language: C. Let's also skip any "code golf" tricks, and see what we get just writing the most obvious code we can for the job:


#include <stdio.h>

int main(int argc, char **argv) {
   FILE *f = fopen(argv[1], "rb");
   int ch, offset = 0, bytes_per_line=16;
   while ((ch=getc(f))!= EOF) {
       if (offset % bytes_per_line == 0)
           printf("\n%6.6d: ", offset);
       ++offset;
       printf("%2.2x ", ch);
   }
}

As it stands, this is 12 lines of code (including one blank line) rather than 7. On the other hand, it is a complete, compilable, working program (and even accepts the file name to dump as a command line argument instead of hard-coding it, so it's semi-usable as-is, not purely demo code). Its obvious shortcoming is that it makes no attempt at verifying that there is an argv[1] before trying to use it.

If you go purely by lines of code, this is about the same length as the LINQ version. If you do like code golfers and count the number of characters in the source code rather than the number of lines, it's about half as long as the LINQ code.

Maybe it's just bias on my part, but to me the C version also seems substantially easier to read. I'd expect that even (for example) a programmer who only knew a language with rather different syntax (something like Ada, for example) would probably be able to figure out exactly how it works with very little difficulty. I'm a lot less certain about that with the LINQ version -- it's not drastically worse, but I'm not at all sure it's as good, and I have difficulty imagining anybody thinking it's any better.

I was a bit curious about the 30-lines Eric quoted for comparison. With a little bit of work, I did come up with a version in C++ that came to around the 30 line range he cited:


#include <iostream>
#include <algorithm>
#include <string>
#include <iomanip>
#include <fstream>
#include <iterator>
#include <ios>

void show_hex(char ch) {
    static int offset = 0;
    static const int bytes_per_line = 16;
    if (offset % bytes_per_line == 0)
        std::cout << "\n" 
                  << std::setw(4) 
                  << std::setprecision(4) 
                  << std::setfill('0') 
                  << std::hex 
                  << offset 
                  << ":";
    ++offset;
    std::cout << " " 
              << std::hex 
              << std::setw(2) 
              << std::setprecision(2) 
              << std::setfill('0') 
              << (int)ch;
}

int main(int argc, char **argv) {
    std::ifstream in(argv[1]);
    std::noskipws(in);
    std::for_each(std::istream_iterator(in),
                  std::istream_iterator(),
                  show_hex);
    return 0;
}

This seems to me like it's stretching the point a bit though: to get to the 30-line mark, I've had to format each manipulator on its own line of code, and count the #include lines. I'm the first to admit (or more often, complain) that iostreams formatting is excessively verbose, but dedicating a line to each manipulator is still working pretty hard at inflating the line count.

When I first saw it, the claim of 30 lines of code for a procedural version seemed a bit high, but I didn't give it much more thought than that. Thinking about it since, I've become somewhat intrigued with trying to figure out how you'd write something this simple to take up that many lines of code in a reasonably meaningful way.

One thought that occurred to me was doing my own hex conversion instead of using something from the standard. Given the verbosity of iostreams manipulators, however, it turns out that code to do the conversion yourself ends up a few lines shorter than the code to tell iostreams how you want it done (at least if you put one manipulator per line, as I did above):


#include <iostream>
#include <algorithm>
#include <string>
#include <fstream>
#include <iterator>

std::string to_hex(unsigned input, unsigned width) { 
    std::string ret(width, '0');
    static const char vals[] = "0123456789abcdef";
    while (input != 0 && width != 0) {
        ret[--width] = vals[input % 16];
        input /= 16;
    }
    return ret;
}

void show_hex(char ch) {
    static int offset = 0;
    static const int bytes_per_line = 16;
    if (offset % bytes_per_line == 0)
        std::cout << "\n" << to_hex(offset, 4) << ":";
    ++offset;
    std::cout << " "  << to_hex(ch, 2);
}

int main(int argc, char **argv) {
    std::ifstream in(argv[1]);
    std::noskipws(in);
    std::for_each(std::istream_iterator(in),
                  std::istream_iterator(),
                  show_hex);
    return 0;
}

Straying from the original point for a moment, I think it is reasonable to call something "verbose" when setting up the parameters to get it to do something takes more code than just doing that work without its help. This, perhaps, is one of the reasons code reuse ends up failing so often: it takes extremely careful design to make code for simple tasks easier to reuse than duplicate.

To summarize: I think this case really favors C over the alternatives. Compiling the C version with a C++ compiler works fine, so using C++ doesn't (necessarily) lose anything, but doesn't seem to gain much either. Getting iostreams involved strikes me as a net loss. While manipulators are reasonably readable individually, in a case like this, the value you're writing gets lost in the mass of manipulators to get it written as you want it.

The LINQ version seems to me to combine the worst of both: conversion specifications that are nearly as cryptic as C's, with overall syntax nearly as verbose as the worst part of the worst C++ version, all (apparently) in an attempt at imitating the parts of SQL that I (at least) find the most irritating. I found the LINQ version unimpressive to start with; with more thought, it seems even more underwhelming (though I hasten to add that this is not intended as a condemnation of LINQ in general).

Sunday, March 6, 2011

Reading files

In case I have any regular readers, I feel obliged to point out that this is somewhere between a rant and (we'll hope) just being aimed at somebody other than you.

In the last week or so, I think I've seen at least a dozen (okay, probably only half that) posts on StackOverflow that included loops something like the following¹:


while (!somestream.eof()) {
    read_data();
    process_data();
}

Let me be clear about this: a loop like this just doesn't work. It's possible (though not particularly easy) to add enough other conditions to exit from the loop at the right time, but if you do it will always² be those other conditions that end the loop, not the loop condition itself. The loop condition itself can't exit the loop at the right time. The problem is that "whatever_stream.eof()" only becomes true after a read fails because you've reached the end of the file. After you've read the last item from the file, "eof()" remains false. You then try to read another item, which fails, but your code executes the rest of the loop, which processes the data as if the read had succeeded. The typical result is that your code appears to read the last item from the file twice.

At least with the way the stream conditions work in C and C++, the simplest way to write a loop that works is something like this instead:


while (read succeeded)
   process data

This can be expressed in any of a number of ways. In C, one common possibility looks something like this:


while (fgets(file, buffer, buffer_len))
    process(buffer);

Another common possibility is for reading one character at a time:


while (EOF != (ch=getchar()))
    process(ch);

A slightly less common version looks something like this:


while (fread(file, buffer, items, item_size))
    process(buffer);

A rough equivalent to the first in C++ looks something like this:


while (std::getline(inputfile, somestring))
    process(somestring);

Another possibility in C++ checks the state of the stream after an extraction operator:


while (infile >> some_data)
    process(some_data);

Perhaps the cleanest method in C++ is to use a standard algorithm along with an istream_iterator to create the loop implicitly. For one example:


std::transform(std::istream_iterator(infile),
               std::istream_iterator(),
               std::back_inserter(some_collection),
               operation);

Any of these can and will work. Personally, I prefer to use an algorithm in most cases, but I'll admit that's a matter of preference, not a requirement. The other loops that test the result of a read can and will do the job perfectly well. Unfortunately, the one using `.eof()` cannot, will not and doesn't work -- pretty much ever. It can't, it won't, and it doesn't.

I've seen a few posts that basically tried to argue that with enough patches it's possible to make it sort of work, or cover up for the fact that it doesn't work. Most of these seem to miss one simple fact though: the easiest way to get from the code that doesn't work to code that does work is to throw out the original loop and start over. Trying to patch together something that even sort of works part of the time is more work than just starting over and writing the code correctly from the beginning.

Whether you can patch them together or not, let me paraphrase C.A.R. Hoare: don't make it so complex that it's not obviously incorrect. Make it so simple that it's obviously correct. If you follow one of the patterns above, correctness will be obvious to almost anybody and everybody who knows what they're looking at. Starting with a loop of the form while (!whatever.eof()) can work -- but even at best, almost nobody will really know it works, at least without a lot of extra work to be sure they've caught every possible corner case.

¹ Of course, in reality, read_data() and process_data() were never presented as functions, and in most cases the loop bodies were fairly long and complex -- but let's ignore that for the moment.
² Okay, I suppose there might be some exception to this, but if so I don't know what it is.

Coder's Central