Monday, September 19, 2011

Still more about version control

This is (more or less) a counterpoint to my last post. For those who found it forgettable (or unreadable), the primary point of my last post was that version control systems combine too many separate (and mostly unrelated) components into a single package. Instead of being built as a single package, a number of components should be specified, to give users the ability to combine the pieces as necessary to fit their needs.

This post is mostly about why that hasn't happened (not only with version control, but with many similar tools as well), and even if it did, the result would probably be pointless and useless. The short summary of the reason is that nobody has designed a usable way to specify/define interfaces between components like this that's independent of both language and underlying operating system.

There are certainly a number of candidates, including CORBA, COM and CLI, Java Beans, Linux D-Bus, etc. Each of these, however, has problems, at least from my perspective.

CORBA is almost a dead issue -- it's been around for decades now, and never achieved enough market share to really matter to more people. It also imposes a fair amount of overhead in many cases, ORBs are often fairly expensive, etc. While it's sometimes used for really big systems, for something like version control it's probably overkill even at best.

COM might be a (halfway) reasonable answer on Windows, but it's really only native and (even reasonably) well supported on Windows. On some other systems there's some support, but it's generally only to support some sort of interoperation with Windows or Windows applications. CLI was (to some extent) intended to be a successor to COM, and to some extent it is. CLI does make it relatively easy to design interfaces between components that (for example) are written in different .NET languages. That is an important qualifier though -- while it's easy to support components in different .NET languages, supporting a component in a language that isn't directly supported on .NET is doubly difficult.

Java Beans are obviously Java-only. At least in my opinion, they're a lousy design anyway.

Linux D-Bus has pretty much the same problems as COM except that its native system has a much smaller market share.

One of the big problems (as I see it) is that nearly every time somebody tries to define something like this, they look at data being marshalled between components, and realize that this data could just about as easily be sent over a network connection as directly from one component to another. Even though the observation is true in itself, I will posit that this is a mistake for at least two reasons.

The first reason is that even though it's true that you can send the data over a network connection just about as easily as between components on one machine, supporting such a distributed system in a meaningful way requires a lot more than just that. Suddenly, you add in requirements for authentication/security, locating a server for a particular service, etc. Solving these, in turn adds still more problems. Just for example, security often involves keys that are only valid for a limited period of time. That, in turn, requires synchronizing all the clocks in the network.

The second reason is that even though from a purely conceptual viewpoint, it doesn't matter whether a function is being called on machine A or machine B, from a practical viewpoint it often does. A network connection imposes much greater overhead than a normal function call -- enough more that you nearly need to take the difference in overhead into account when designing a system. If you're going to invoke a service on some machine that may be halfway across the country, connected by a dozen hops with an overhead in the range of tens or hundreds of milliseconds, you need to ensure that you make a *few* function calls, each of which accomplishes a great deal. Even via some neutral interface, if the caller and callee are on the same computer, we can expect the overhead of calling a "remote" function to be *much* lower -- nanoseconds to (possibly) low microseconds, not tens to quite possibly hundreds of milliseconds.

These existing systems are poorly suited to the relatively small, simple, types of systems I have in mind. I do think something better could be done (or maybe just popularized -- D-Bus is actually quite good). I'm a lot less certain that anybody's putting much (if any) real effort into doing it though. Bottom line: for most practical purposes, you can't do it now, and probably won't be able to very soon either.

Sunday, September 11, 2011

More about version control

Over the last few weeks, I've been thinking a little more deeply about version control, and why I see it as a problematic situation at the present time.

After some thought, I've decided that the problem is that version control has been (and currently is) primarily oriented toward (what are intended to be) complete products. I think, instead, development in version control should be oriented toward components, with some defined protocols for those components to use when they talk to each other.

Defining these protocols would be non-trivial, but I think doing so would go a long ways toward curing what I see as problems with version control at the present time. It take virtually no looking at all to find multitudes of opinions about the strengths and weaknesses of various version control systems -- Git has a great merge engine, but a very expert-oriented UI and is really only usable on Linux. Mercurial is more friendly and portable, but doesn't have as good of a merge engine. Depending on the kinds of changes involved, one might be substantially faster than the other at propagating changes from one repository to another. Windowing systems for either (to the extent they're available at all) work by generating command lines and spawning the appropriate commands. Virtually every reasonably modern version control system defines at least one protocol to make repositories available via a network. Even though almost everybody agrees that distributed version control is generally a better long-term solution, many still argue that Subversion is enough simpler to justify its use in many smaller shops.

IMO, this leads to a problematic situation: a version control system includes so many parts with such minimal relationships with each other, that virtually every one of them has quite a few shortcomings for nearly every real situation (nearly the only obvious possible exception being Git for Linux kernel development).

Now, as I see it, a version control system is primarily a file system. Where a "normal" file system sees each file as a single item, in a version control system each file may actually be multiple items. In some cases, those items form essentially a single line over time, so there's a single current file at any given point in time. In other cases, "a file" may include multiple "threads", each of which has a current version at any given point in time.

At the very least, this means a version control system needs not only a way of specifying a file, but a point along the timeline for that file. The versions that support multiple branches of a file also need a way of specifying which branch is desired. For convenience, we want some way to do one or both for an entire (often quite large) group of files.

To support the file system, you need at least a few components: the most basic and universal is a differencing engine -- something to look at two versions of a file, and figure out some (hopefully minimal) set of changes to transform one file into the other. This lets us store multiple versions of a file as a single file with multiple (usually fairly small) sets of changes instead of lots of copies of the whole file.

Another that's increasingly common and important is more or less the opposite: a merge engine. Given some starting file and some sets of changes, an ability to merge those changes into a single "current" file automatically finds all the changes that don't conflict with each other, and includes them all, and makes it easy for a person to select which changes to include when there are conflicts.

As mentioned above, for a reasonably modern system we also normally want some way to make a repository available over a network -- possibly only an internal network, but for many purposes, making it directly available to external users via the Internet is useful as well.

In the current situation, all of those (mostly unrelated) parts are embodied in a single product. Somebody who wants to specialize in text merging has to spend a fair amount of time getting to know one code base and how its merge engine talks to the rest of the system before he can make his ideas usable with one system. Having done that his ideas remain useful only for that one system, so if somebody wants to do the same cool stuff with another system, it's probably going to take a lot of work learning both systems well enough to figure out how that code works, and how it'll need to work differently in the new system, and translate from one to the other. It's enough work that even if somebody does it once, chances are pretty good that it'll lead to forked development, with the two having to be maintained in parallel from that point forward. The same, of course, is true of all the other pieces and parts -- the diff engine, the network protocols, etc.

When looking at actual code, most programmers agree on things like "single responsibility principle", "Don't Repeat Yourself (DRY)", etc. Within a single code base, most of us apply these (albeit with varying degrees of rigor). Strangely, however, applying these same principles at a larger scale seems quite unusual -- and that's exactly what I see here. Practices we wouldn't even consider tolerating within a code base, are completely normal and expected across code bases.

Now, it is certainly true that version control is not unique in this respect. Early in the history of (almost?) any kind of tool, it's common to see more or less exploratory programming, where people with all sorts of different ideas do all sorts of radically different things. Over time, however, the field tends to narrow to the point that it's reasonable to standardize many of those -- and explicitly state what parts are being treated as abstractions with standardized interfaces but implementation open to variation. Looking at the "field", however, I think version control has reached the point that it's ripe for standardization.

Given the degree to which version control is taken as the sine qua non of software engineering in general (e.g., one of the questions on the "Joel Test" is "Do you use source control?") it seems...interesting that little has been done to apply the principles of software engineering not only to the code for a specific version control product, but to version control as a whole.