Monday, September 19, 2011

Still more about version control

This is (more or less) a counterpoint to my last post. For those who found it forgettable (or unreadable), the primary point of my last post was that version control systems combine too many separate (and mostly unrelated) components into a single package. Instead of being built as a single package, a number of components should be specified, to give users the ability to combine the pieces as necessary to fit their needs.

This post is mostly about why that hasn't happened (not only with version control, but with many similar tools as well), and even if it did, the result would probably be pointless and useless. The short summary of the reason is that nobody has designed a usable way to specify/define interfaces between components like this that's independent of both language and underlying operating system.

There are certainly a number of candidates, including CORBA, COM and CLI, Java Beans, Linux D-Bus, etc. Each of these, however, has problems, at least from my perspective.

CORBA is almost a dead issue -- it's been around for decades now, and never achieved enough market share to really matter to more people. It also imposes a fair amount of overhead in many cases, ORBs are often fairly expensive, etc. While it's sometimes used for really big systems, for something like version control it's probably overkill even at best.

COM might be a (halfway) reasonable answer on Windows, but it's really only native and (even reasonably) well supported on Windows. On some other systems there's some support, but it's generally only to support some sort of interoperation with Windows or Windows applications. CLI was (to some extent) intended to be a successor to COM, and to some extent it is. CLI does make it relatively easy to design interfaces between components that (for example) are written in different .NET languages. That is an important qualifier though -- while it's easy to support components in different .NET languages, supporting a component in a language that isn't directly supported on .NET is doubly difficult.

Java Beans are obviously Java-only. At least in my opinion, they're a lousy design anyway.

Linux D-Bus has pretty much the same problems as COM except that its native system has a much smaller market share.

One of the big problems (as I see it) is that nearly every time somebody tries to define something like this, they look at data being marshalled between components, and realize that this data could just about as easily be sent over a network connection as directly from one component to another. Even though the observation is true in itself, I will posit that this is a mistake for at least two reasons.

The first reason is that even though it's true that you can send the data over a network connection just about as easily as between components on one machine, supporting such a distributed system in a meaningful way requires a lot more than just that. Suddenly, you add in requirements for authentication/security, locating a server for a particular service, etc. Solving these, in turn adds still more problems. Just for example, security often involves keys that are only valid for a limited period of time. That, in turn, requires synchronizing all the clocks in the network.

The second reason is that even though from a purely conceptual viewpoint, it doesn't matter whether a function is being called on machine A or machine B, from a practical viewpoint it often does. A network connection imposes much greater overhead than a normal function call -- enough more that you nearly need to take the difference in overhead into account when designing a system. If you're going to invoke a service on some machine that may be halfway across the country, connected by a dozen hops with an overhead in the range of tens or hundreds of milliseconds, you need to ensure that you make a *few* function calls, each of which accomplishes a great deal. Even via some neutral interface, if the caller and callee are on the same computer, we can expect the overhead of calling a "remote" function to be *much* lower -- nanoseconds to (possibly) low microseconds, not tens to quite possibly hundreds of milliseconds.

These existing systems are poorly suited to the relatively small, simple, types of systems I have in mind. I do think something better could be done (or maybe just popularized -- D-Bus is actually quite good). I'm a lot less certain that anybody's putting much (if any) real effort into doing it though. Bottom line: for most practical purposes, you can't do it now, and probably won't be able to very soon either.


  1. I think there's a third reason the network transparency thing doesn't work: error handling. Adding networking to an application adds a whole new class of errors that aren't even handled with the same type of reactions. To keep with the example you mentioned (network clock synchronization), getting an error of that type probably means you can't attempt further operations, etc.

    Handling network errors suitably well is much easier when you're dealing with serialization and communication explicitly.

  2. Also, I don't believe in technology unification as the API-level. You need to keep in sync bindings to several different programming languages, compilers, interpreters, operating systems etc. People already have a hard time migrating from one platform _version_ to another. Multiply all these problems by the number of versions of your "unifying" technology.

    Obligatory XKCD reference:

    Adding yet another technology to the mix to try "bring them all, and in the darkness bind them" is probably a bad move. At least, at the API level. In practice, I've had better results when using the same communications protocol explicitly from the different entities. For example, I use a simple and efficient web server (nginx) and delegate page generation over FastCGI to a bunch of isolated processes (php4, php5, python, ruby, clojure, etc.). This allows each of the applications to run with their own technology. Perhaps what version control needs is a set of communication standards. One example that comes to mind is the Unified Diff Format ( You can quite easily ship that over the network when necessary. Until one of the tools in the lot decides to do fancier diffs.

    I think the real reason there hasn't been any unification in version control tools (or programming languages and environments) is that people can't agree on anything. I mean, after all these years, people can't even agree on the same text editor.

  3. Good points. Yes, it's certainly true that dealing with errors over the network is usually different from locally.

    As far as editors go, however, I have to disagree a bit: I'd expect editors to be the last area where we'd see agreement. Asking for people to agree on what constitutes the perfect editor is like asking people to agree on what makes the perfect lover (and for many of the same reasons).