Tuesday, May 24, 2011

Some months ago, there was a bit of hoopla over a blog post by Eric White showing (or at least purporting to show) how C#'s LINQ produced much shorter code than you'd get from writing procedural code. In this case, the example used was writing a minimal hex dump -- take some input from a file, convert the bytes to hexadecimal, and write it out 16-bytes per line.

He says he searched for existing code to do the job, and found that it took around 30 lines of (procedural) code. He then gives the following piece of code using LINQ:


byte[] ba = File.ReadAllBytes("test.xml");
int bytesPerLine = 16;
string hexDump = ba.Select((c, i) => new { Char = c, Chunk = i / bytesPerLine })
.GroupBy(c => c.Chunk)
.Select(g => g.Select(c => String.Format("{0:X2} ", c.Char))
.Aggregate((s, i) => s + i))
.Select((s, i) => String.Format("{0:d6}: {1}", i * bytesPerLine, s))
.Aggregate("", (s, i) => s + i + Environment.NewLine);
Console.WriteLine(hexDump);


Perhaps it's just showing my lack of expertise in LINQ, but this does not strike me as particularly transparent code (though, I suppose it may be an improvement on what he found -- it's essentially impossible to say without seeing that code).

As far as line count goes, I guess I'm a bit uncertain. The original blog post makes rather a point of this being only 7 lines of code (in fact "7-lines-of-code" is even part of its URI). At least as it's formatted in the original post, it looks like 9 lines to me though. Maybe this is due to reformatting, or maybe he's not counting the two lines of variable definitions at the beginning (although they do seem to be needed for the code to work).

Unless I'm badly mistaken, you also need to add a few more lines before a C# compiler will accept this code. In particular, I believe you need to put the code into a method (a member function, if you prefer) which must, itself, be part of a class. Despite the "7 lines" claim, by the time you make it actually work, you're probably up around a dozen lines of code or so. I point this out not to impugn any dishonesty on Eric's part, or anything like that, but only to try to set a fair standard for comparison -- if you're going to write a complete program that does this, the standard of comparison isn't 7 lines, but somewhere around 12-14.

Now, it seems like an obvious reaction to this would be to wonder exactly what could be done in procedural code. Using something like Perl (and possibly Python) I'd almost expect this to be a one-liner (a couple of lines at very most). To try to keep things a bit more fair, let's stick to the baseline, "canonical" procedural language: C. Let's also skip any "code golf" tricks, and see what we get just writing the most obvious code we can for the job:


#include <stdio.h>

int main(int argc, char **argv) {
FILE *f = fopen(argv[1], "rb");
int ch, offset = 0, bytes_per_line=16;
while ((ch=getc(f))!= EOF) {
if (offset % bytes_per_line == 0)
printf("\n%6.6d: ", offset);
++offset;
printf("%2.2x ", ch);
}
}


As it stands, this is 12 lines of code (including one blank line) rather than 7. On the other hand, it is a complete, compilable, working program (and even accepts the file name to dump as a command line argument instead of hard-coding it, so it's semi-usable as-is, not purely demo code). Its obvious shortcoming is that it makes no attempt at verifying that there is an argv[1] before trying to use it.

If you go purely by lines of code, this is about the same length as the LINQ version. If you do like code golfers and count the number of characters in the source code rather than the number of lines, it's about half as long as the LINQ code.

Maybe it's just bias on my part, but to me the C version also seems substantially easier to read. I'd expect that even (for example) a programmer who only knew a language with rather different syntax (something like Ada, for example) would probably be able to figure out exactly how it works with very little difficulty. I'm a lot less certain about that with the LINQ version -- it's not drastically worse, but I'm not at all sure it's as good, and I have difficulty imagining anybody thinking it's any better.

I was a bit curious about the 30-lines Eric quoted for comparison. With a little bit of work, I did come up with a version in C++ that came to around the 30 line range he cited:


#include <iostream>
#include <algorithm>
#include <string>
#include <iomanip>
#include <fstream>
#include <iterator>
#include <ios>

void show_hex(char ch) {
static int offset = 0;
static const int bytes_per_line = 16;
if (offset % bytes_per_line == 0)
std::cout << "\n"
<< std::setw(4)
<< std::setprecision(4)
<< std::setfill('0')
<< std::hex
<< offset
<< ":";
++offset;
std::cout << " "
<< std::hex
<< std::setw(2)
<< std::setprecision(2)
<< std::setfill('0')
<< (int)ch;
}

int main(int argc, char **argv) {
std::ifstream in(argv[1]);
std::noskipws(in);
std::for_each(std::istream_iterator(in),
std::istream_iterator(),
show_hex);
return 0;
}


This seems to me like it's stretching the point a bit though: to get to the 30-line mark, I've had to format each manipulator on its own line of code, and count the #include lines. I'm the first to admit (or more often, complain) that iostreams formatting is excessively verbose, but dedicating a line to each manipulator is still working pretty hard at inflating the line count.

When I first saw it, the claim of 30 lines of code for a procedural version seemed a bit high, but I didn't give it much more thought than that. Thinking about it since, I've become somewhat intrigued with trying to figure out how you'd write something this simple to take up that many lines of code in a reasonably meaningful way.

One thought that occurred to me was doing my own hex conversion instead of using something from the standard. Given the verbosity of iostreams manipulators, however, it turns out that code to do the conversion yourself ends up a few lines shorter than the code to tell iostreams how you want it done (at least if you put one manipulator per line, as I did above):


#include <iostream>
#include <algorithm>
#include <string>
#include <fstream>
#include <iterator>

std::string to_hex(unsigned input, unsigned width) {
std::string ret(width, '0');
static const char vals[] = "0123456789abcdef";
while (input != 0 && width != 0) {
ret[--width] = vals[input % 16];
input /= 16;
}
return ret;
}

void show_hex(char ch) {
static int offset = 0;
static const int bytes_per_line = 16;
if (offset % bytes_per_line == 0)
std::cout << "\n" << to_hex(offset, 4) << ":";
++offset;
std::cout << " " << to_hex(ch, 2);
}

int main(int argc, char **argv) {
std::ifstream in(argv[1]);
std::noskipws(in);
std::for_each(std::istream_iterator(in),
std::istream_iterator(),
show_hex);
return 0;
}


Straying from the original point for a moment, I think it is reasonable to call something "verbose" when setting up the parameters to get it to do something takes more code than just doing that work without its help. This, perhaps, is one of the reasons code reuse ends up failing so often: it takes extremely careful design to make code for simple tasks easier to reuse than duplicate.

To summarize: I think this case really favors C over the alternatives. Compiling the C version with a C++ compiler works fine, so using C++ doesn't (necessarily) lose anything, but doesn't seem to gain much either. Getting iostreams involved strikes me as a net loss. While manipulators are reasonably readable individually, in a case like this, the value you're writing gets lost in the mass of manipulators to get it written as you want it.

The LINQ version seems to me to combine the worst of both: conversion specifications that are nearly as cryptic as C's, with overall syntax nearly as verbose as the worst part of the worst C++ version, all (apparently) in an attempt at imitating the parts of SQL that I (at least) find the most irritating. I found the LINQ version unimpressive to start with; with more thought, it seems even more underwhelming (though I hasten to add that this is not intended as a condemnation of LINQ in general).