Sunday, September 11, 2011

More about version control

Over the last few weeks, I've been thinking a little more deeply about version control, and why I see it as a problematic situation at the present time.

After some thought, I've decided that the problem is that version control has been (and currently is) primarily oriented toward (what are intended to be) complete products. I think, instead, development in version control should be oriented toward components, with some defined protocols for those components to use when they talk to each other.

Defining these protocols would be non-trivial, but I think doing so would go a long ways toward curing what I see as problems with version control at the present time. It take virtually no looking at all to find multitudes of opinions about the strengths and weaknesses of various version control systems -- Git has a great merge engine, but a very expert-oriented UI and is really only usable on Linux. Mercurial is more friendly and portable, but doesn't have as good of a merge engine. Depending on the kinds of changes involved, one might be substantially faster than the other at propagating changes from one repository to another. Windowing systems for either (to the extent they're available at all) work by generating command lines and spawning the appropriate commands. Virtually every reasonably modern version control system defines at least one protocol to make repositories available via a network. Even though almost everybody agrees that distributed version control is generally a better long-term solution, many still argue that Subversion is enough simpler to justify its use in many smaller shops.

IMO, this leads to a problematic situation: a version control system includes so many parts with such minimal relationships with each other, that virtually every one of them has quite a few shortcomings for nearly every real situation (nearly the only obvious possible exception being Git for Linux kernel development).

Now, as I see it, a version control system is primarily a file system. Where a "normal" file system sees each file as a single item, in a version control system each file may actually be multiple items. In some cases, those items form essentially a single line over time, so there's a single current file at any given point in time. In other cases, "a file" may include multiple "threads", each of which has a current version at any given point in time.

At the very least, this means a version control system needs not only a way of specifying a file, but a point along the timeline for that file. The versions that support multiple branches of a file also need a way of specifying which branch is desired. For convenience, we want some way to do one or both for an entire (often quite large) group of files.

To support the file system, you need at least a few components: the most basic and universal is a differencing engine -- something to look at two versions of a file, and figure out some (hopefully minimal) set of changes to transform one file into the other. This lets us store multiple versions of a file as a single file with multiple (usually fairly small) sets of changes instead of lots of copies of the whole file.

Another that's increasingly common and important is more or less the opposite: a merge engine. Given some starting file and some sets of changes, an ability to merge those changes into a single "current" file automatically finds all the changes that don't conflict with each other, and includes them all, and makes it easy for a person to select which changes to include when there are conflicts.

As mentioned above, for a reasonably modern system we also normally want some way to make a repository available over a network -- possibly only an internal network, but for many purposes, making it directly available to external users via the Internet is useful as well.

In the current situation, all of those (mostly unrelated) parts are embodied in a single product. Somebody who wants to specialize in text merging has to spend a fair amount of time getting to know one code base and how its merge engine talks to the rest of the system before he can make his ideas usable with one system. Having done that his ideas remain useful only for that one system, so if somebody wants to do the same cool stuff with another system, it's probably going to take a lot of work learning both systems well enough to figure out how that code works, and how it'll need to work differently in the new system, and translate from one to the other. It's enough work that even if somebody does it once, chances are pretty good that it'll lead to forked development, with the two having to be maintained in parallel from that point forward. The same, of course, is true of all the other pieces and parts -- the diff engine, the network protocols, etc.

When looking at actual code, most programmers agree on things like "single responsibility principle", "Don't Repeat Yourself (DRY)", etc. Within a single code base, most of us apply these (albeit with varying degrees of rigor). Strangely, however, applying these same principles at a larger scale seems quite unusual -- and that's exactly what I see here. Practices we wouldn't even consider tolerating within a code base, are completely normal and expected across code bases.

Now, it is certainly true that version control is not unique in this respect. Early in the history of (almost?) any kind of tool, it's common to see more or less exploratory programming, where people with all sorts of different ideas do all sorts of radically different things. Over time, however, the field tends to narrow to the point that it's reasonable to standardize many of those -- and explicitly state what parts are being treated as abstractions with standardized interfaces but implementation open to variation. Looking at the "field", however, I think version control has reached the point that it's ripe for standardization.

Given the degree to which version control is taken as the sine qua non of software engineering in general (e.g., one of the questions on the "Joel Test" is "Do you use source control?") it seems...interesting that little has been done to apply the principles of software engineering not only to the code for a specific version control product, but to version control as a whole.

No comments:

Post a Comment