Friday, December 7, 2012

Heaps and heapsort

Like most programmers, quite some time ago I'd read about heaps and the heapsort. Being a dutiful student, I could quote the standard line, saying that a heap was an implicit tree where node N was less than (or greater than, depending on the type of heap) nodes 2N and 2N+1. Like (I think) most, I'd also implemented a heapsort -- but to be honest, most of really came down to a line by line translation from the description in Knuth Volume 3 into the syntax of the programming language I was using at the time (Pascal, the first time around, and later the same again in C).

For years, given that I'd written a Heap sort that worked, I considered my knowledge of heaps and the heap sort entirely adequate. To be entirely honest, however, during most of that time I never really fully understood heaps. In particular, I never really understood the 2N and 2N+1 -- why those particular numbers/nodes? What was the significance of 2N and 2N+1? Why not 3N or 1.5N+65 or something? In short, that "implicit tree" had never really gelled for me -- I'd read the words, but never truly understood what they meant or how the "tree" worked.

For some reason a few months ago my lack of understanding on that particular subject bubbled to the top of the heap (I know: puns are the death of real humor) and I decided to implement a heap (and heap sort) entirely on my own, based only on my recollection about 2N and 2N+1. Instead of going to the book and translating syntax, I sat down with a supply of paper and pencils (the most underrated programming tools ever) and re-invented the heap and the algorithms to create and maintain a heap from first principles.

So, what did I do to finally and truly understand a heap? It ultimately came down to one piece of paper:

And from there came true understanding at last. If we arranged the nodes by layers, it "just happened" that the children of every node N were nodes 2N and 2N+1. Better still, after three layers of the tree, it was pretty apparent that as/if I added more to the tree, the same relationships would hold. Only 30 years (or so) after I'd first written a heap sort, I finally understood what I was doing!

I finally also understood why people who understood heaps thought they were cool: since the "links" in the tree are all based on (relatively simple) math, it's a tree you can walk almost any direction you'd care to think about by simply doing the right math. Just for example, in a normal binary tree, the only way to get from a node to its parent is to traverse down the tree, and remember the parent on the way there so when you need it back, you can recall it from wherever you have it stored. Likewise, getting to the sibling of a node requires that you get back to the parent (see above) and then take the other branch downward to get to the sibling.

With a heap, however, both of these are simply trivial math: the children of node N are nodes 2N and 2N+1, so the parent of node M is node M/2 (using truncating/integer math). Likewise, the sibling of node N is either node N+1 or N-1 -- if N is even, the sibling is N+1, and if N is odd, the sibling is N-1.

So, for any given node, we can walk through the tree up, down, or sideways (literally) with a single, simple math operation. Better still, all the nodes are contiguous in memory, with no space used up on links, so it's also pretty cache-friendly in general.

There was one other point I ran into repeatedly as I worked at understanding what I was doing well enough to produce working code: the math all depends on 1-based indexing. Making it work right with 0-based indexing like most currently languages use was not trivial. In fact, I finally decided that for this purpose it was better to maintain the illusion of 1-based addressing for most of the code, and since I was writing this in C++, have a truly trivial class to handle the translation from 1-base to 0-based addressing (i.e., subtract one from every index I used). As simple of an operation as that seems like it should be, keeping it in one place still simplified the rest of the code tremendously (and made it much easier to see the relationship between the code and that fundamental 2N/2N+1 relationship that defines the heap).

Anyway, in case anybody cares to look at the resulting code, I've decided that trying to post significant amounts of code directly here is too painful to be worthwhile, so here is a link to it on ideone.com. I think this is long enough for now, but sometime soon I'll probably write a post going through that code in more detail, explaining how it works and why I did things the way I did.