Thursday, February 4, 2016

Raw loops vs. standard algorithms revisited

I was recently reading a Post on Meeting C++ about raw loops vs. STL algorithms.

I didn't think his solution using STL algorithms took full advantage of what the library really has to offer. He did use std::mismatch to find the point at which items mismatched, which is clearly a good thing. The rest of the code, however, looked to me like he was thinking primarily in terms of loops, and merely changed the syntax from a for loop to use std::for_each instead.

At least in my opinion, this doesn't really gain much. At least in my observation, using `std::for_each` is most often a mistake. It's still little more than a generic loop. The name of algorithm used should usually give me a good idea of what some code is doing, but `std::for_each` really doesn't--I have to look at the content of the "loop" to get any idea of what it's really doing.

In this case, the basic idea (simplifying a tiny bit) is to take the paths to two files, and find a relative path from one to the other. The two paths have already been parsed into components held as strings in a vector, so our job is to generate a path with "../" for all the leading components that are identical, then copy the components from one of the vectors into the output after that (and finally append a file name on the end).

Just going from that description, some choices seem fairly obvious, at least to me. To generate the leading part of the path, we probably want to use `std::generate`. Then to copy the remaining elements, we probably want to use `std::copy`. Although it may not be quite as obvious just from the wording, `std::fill_n` actually works out a little better than `std::generate_n`.

Our result is going to be in the form of a string. We want to append strings together to create the whole output. One obvious way to do that is to use an `std::stringstream`. So, let's start by doing a tiny bit of fudging to turn this into a standalone example (I don't think this has a significant effect on what we're doing). Rather than extract the vectors from a cache as he did, let's start with a couple of fixed vectors:

    std::vector a = { "dir1", "dir2", "dir2_1", "p" };
    std::vector b = { "dir1", "dir2", "dir2_2", "p" };

Then we do the step that (I think) he got right:

    auto pos_p = std::mismatch(a.begin(), a.end(), b.begin(), b.end());

From there, we can define our stringstream, and an iterator to give us access to it in a way that's friendly to standard algorithms:

std::ostringstream path;

std::ostream_iterator(path, "/");

Then we want to generate the leading part, then copy the trailing part of the result path:

std::fill_n(out, pos_p.first - a.begin(), "..");
std::copy(pos_p.second, b.end(), out);

At least in my opinion, this makes it much more apparent that the first part of the path is composed of instances of ".." separated by "/", and the second part is composed of elements from the second vector (again, separated by "/").

This hasn't shortened the code a lot (maybe not at all), but it has (at least in my opinion) made the actual intent much clearer.

For better or worse, I've seen a fair number of arguments/claims that using a stringstream and an ostream_iterator can produce substantially slower code that doing the job by appending data directly to a string. I haven't tried to run a benchmark to verify this, but for the moment, let's assume you were working with a compiler for which it really was true (and true to an extent that the slow-down was a serious concern).

If so, we could still use roughly the same structure--we just have to define our own iterator to handle appending things to a string instead of inserting them into a stringstream. We lose some versatility (we can only insert things that are already strings, rather than handling all sorts of conversions to strings like a stream can), but maybe we gain enough speed for that to be worthwhile. So, if we decide this is really worth doing, implementing it is pretty straightforward (moreso than most people seem to think).

Our iterator to append to a string (with a little front-end function to avoid having to pass template parameters explicitly) can look something like this:


class appender_t : public std::iterator{
    String *s;
    String sep;
    appender_t(String &s, String sep) : s(&s), sep(sep) { }

    appender_t operator=(String const &add) {
        *s += add + sep;
        return *this;

    appender_t operator++() { return *this; }

    appender_t &operator *() { return *this; }

appender_t appender(String &s, String sep) {
    return appender_t(s, sep);

Most of this is boilerplate--the only parts that really implement any logic at all are the constructor and `operator=`. The constructor is still pretty trivial: it just saves a pointer to the two strings used to initialize it. The assignment operator isn't much more complex--it just says when a string is written through the iterator, we append that string plus the specified separator to the string we stored.

Using this, the rest of the code barely changes at all. We define our output and iterator a tiny bit differently:

    std::string path;

    auto out = appender(path, "/"s);

From there, however, virtually nothing has changed (which is pretty much the point of using iterators to start with--they isolate the algorithm from the container, so algorithms can work with different kinds of containers.

So anyway, using that we can eliminate the pesky `std::stringstream` intermediary, so we're just appending directly to a string, just like we were to start with--but now doing it with algorithms that actually fit the task at hand.

A complete, working copy of the final code is available on Coliru.

I should probably add that this code is purely an attempt at producing exactly the same result as the code in the original article. As Luigi Ballabio was kind enough to point out in a comment, that's not really the right thing to do.

Saturday, August 9, 2014

Do Quora users actually have brains at all?

Right now I'm somewhat ashamed to admit it, but I have an account on Quora. Every week (or so--I don't pay very close attention) they send me an email digest of popular questions and answers. It might be an exaggeration (but if so, only a fairly small one) to say I was shocked by the question and answer that were most highly featured in their most recent email.

The question was: "Why do Americans seem to be so scared of a European/Canadian style of healthcare system?" The top-voted answer (at least as I read this) is: "In a word - fear" (yes, I cut and pasted that--if you think I must be making this up, I'd consider that entirely forgivable). Now, it's true that the author does go into more about detail about what he thinks drives that fear, but "fear" is what he gives as the actual answer. So, "Why do Americans seem to be so scared" is answered with "fear", and (at least as I write this) that has received literally thousands of up-votes.

Unfortunately, the "answer" goes downhill from there. The author (a "Dan Munro", who gives an unsupported claim of "knows some healthcare stuff"). He gives four supposed reasons for his claim of "fear":

  1. A false assumption (with big political support) that a system based on universal coverage is the same thing as a single payer system.
  2. A fear of "rationing" - which was set ablaze by Sarah Palin and her cavalier remarks about "death panels."
  3. An attitude and culture of what's loosely known as American Exceptionalism.
  4. A fierce independence that has a really dark side (which he goes on to explain as an attitude that when/if anybody "fails", it's considered their own fault).

To put it simply, I've yet to see any real support for any of these. The only one that seems to have even the slightest basis in reality is the second. It us true that a few (Sarah Palin being the most obvious) have tried to generate fear on this basis. It may even be true that it has succeeded in at least some tiny number of cases. The vast majority of people who oppose (further) government regulation of healthcare, however, seem to find it nearly laughable.

The most ridiculous, by far, is the claim of "American Exceptionalism". While frequently advanced (relative to an almost surprising range of subjects), this seems to be a pure straw-man at least with respect to this question. I have quite literally never heard anybody dismiss an argument on the basis that "it came from Canada, Europe, or outside the US in general, and we can't possibly learn anything from 'them'." At least in my experience, it simply doesn't happen. I obviously can't possibly know exactly what every single person in the US believes or feels, but I've yet to see or hear anything that gives even the slightest shred of support for this particular belief.

He then goes on to quote some of the usual statistics about how much of the US GDP is spent on healthcare, and a study about deaths from preventable medical errors in US hospitals.

Unfortunately, the numbers he quotes (210,000 to 440,000 annually) seem to be open to a great deal of question. Depending on which study you prefer to trust, the number is might be as low as 32,500 annually (and no other study I can find gives a number any higher than about 100,000 annually). Despite this, the largest number he can find is quoted as if it were an absolute fact that's not open to any question at all.

Worse, however, is a much simpler fact: he makes no attempt at comparing this result to the numbers for other countries, and (perhaps worst of all) he makes absolutely no attempt at telling us how the change(s) he apparently advocates would improve this in any way. So, even if we assume he's gotten the factual part right, we have absolutely *no* reason to believe any particular plan will improve it in any way.

Although I can't claim to speak for the US in general (or anybody else at all) in this regard, that leads directly toward a large part of the reason I have personally found it impossible to generate any enthusiasm for the plans that have been advanced to change healthcare in the US.

The usual argument seems to run something like this. The advocate starts by pointing to US citizens paying far more than others for healthcare, and having shorter average life spans. He then uses that to support his claim that we need to pass some particular health care plan he supports.

Unfortunately, nobody making such arguments seems to (even try to) "connecting the dots" to show exactly how or why *their* plan will improve the problems they ascribe to the current system. Virtually none can provide any real breakdown of US healthcare costs vs. costs elsewhere, to indicate exactly what is driving the higher costs in the US. Absolutely none (that I've seen) takes any next step toward showing how their plan will fix those problems.

When I've participated in a discussion, it usually runs like this:
Them: Our current system is broken. We need to pass bill X to fix it.
Me: What will X fix, and how will it fix it?
Them: Didn't you listen? It's really broken!
Me: Okay. What will X fix and how will it fix it?
Them: I'm telling you, it's seriously broken!
Me: Yes, I hear that. Now, can you tell me what X will fix and how it will fix it?
Them: [usually starting to get pretty loud by now] Damn it! What will it take to get it through your head that it's broken? Are you a complete idiot?

This seems to go on about as long as I'm willing to participate. I've yet to hear a single advocate of a single system actually answer a single real question about what it will fix, how it will fix it, what costs will be reduced, how much they will be reduced, etc. No matter how often you ask, even about relatively simple details of any proposed program, nobody seems to have a single answer about what they think it will accomplish, not to mention providing any reason for me to believe that it really will accomplish that.

If you'll forgive a (mildly) medically-oriented metaphor, Quora seems to be infected with a similar culture. Questions pre-suppose a given answer (in this case, that resistance to changes in health-care stems from fear rather than things like lack of information). This is certainly far from the only answer that seems to do little beyond echoing back the question, with only straw man arguments to support it, then a rather disjointed attempt at denigrating what the author dislikes, without even an attempt at claiming that his "solution" would really fix the supposed problem(s) he cites, not to mention actually providing anything similar to real evidence that of improvement.

So yes, although Quora users undoubtedly do have brains, it certainly appears to me that failure to actually put them to use is somewhere between common and rampant.

Considering questions more specifically about health care: I, for one, am fairly open to the possibility of reforming how healthcare is run in the US. So far, however, I've yet to hear anybody advance a coherent, understandable argument in favor of any proposed plan. Most simply seem convinced that the current system is *so* badly broken that absolutely any change would have to be an improvement. Even the slightest study of history, however, indicates that there is no such thing as a situation so bad that it really can't get worse.

Monday, December 2, 2013

Inheritance, templates, and database models

Some years ago (around 1991, as a matter of fact) I observed that object oriented programming, and C++ in particular, was following a path similar to that followed by databases decades before.

At the time, multiple inheritance had recently been added to C++. I noted that single inheritance corresponded closely to the hierarchical database model, and multiple inheritance corresponded closely to the network model.

As I'd guess most people reading this are well aware, most databases have long-since switched to (something at least intended to be similar to) the relational model and there are formal proofs that the relational model is strictly more powerful than either the hierarchical or the network model. That led to an obvious question: what sort of program organization would correspond to the relational model?

I would posit that C++ templates (especially function templates) correspond closely to the relational model. Rather than being restricted to situations like "X is derived from Y", we can instantiate a template over any type satisfying some relation--though in the case of a template, the relation is basically a set of operations that must be supported by the type rather than a set of values that must be stored in the database.

The major problem with templates (by themselves) is that we have very little control over how that matching is done. The matching criteria are difficult for most people to understand, and even harder to control. Even at best, the control currently available (e.g., via static_assert) is clumsy use and hard to read.

Concepts refine that capability considerably. Concepts are similar to defining foreign keys in a database. Right now, what we have is similar to a bunch of tables, with some fields (or sets of fields) in some of those tables used as foreign keys in other tables--but no declarations of what is supposed to be used as a foreign key, and therefore no way for the compiler to (a C++ equivalent of) enforce referential integrity.

As I'm sure most programmers are already well aware, a key (no pun intended) capability of a relational database is defining relationships between tables so the database manager can enforce referential integrity. That is similar to what C++ Concepts add--something at least roughly equivalent to foreign key declarations in a database, defining what relations are intended and should be enforced by the compiler.

Now don't get me wrong: I'm not trying to write a sales-pitch for Concepts, or anything like that. Rather the contrary, I think quite a few people have probably done a better job of explaining them and how they improve C++ than I've even attempted. What I find much more interesting is seeing (or maybe just imagining--others may disagree with my ideas) a fairly clear similarity between two types of programming that (I think) many probably see as quite unrelated or perhaps even directly opposed.

Sunday, August 11, 2013

When is a question about a puzzle, and when is it really about programming?

The following question was recently asked on Stack Overflow:
This is the problem: There are two players A and B. Player A starts and chooses a number in the range 1 to 10. Players take turn and in each step add a number in the range 1 to 10. The player who reaches 100 first wins.

In the program, player A is the user and player B is the computer. Besides that the computer must force a win whenever possible. For instance, if the player A reaches 88 at some point, the computer must choose 89, as this is the only way to force a win.

A user then voted to close the question, with the comment that it was about a logic or math problem rather than about programming.

Personally, I disagree with that. In fact, I think it's an unusually good example of a problem for learning to program.

Allow me to digress for a moment. The first step in learning to program is learning to take some algorithm, and transcribe it into the syntax of a chosen programming language. Now, it's certainly true that being able to express an algorithm in a chosen programming language is important. It's not just about syntax either. Just for example, expressing an algorithm well often involves knowing a great deal about the accepted idioms of that language as well. Nonetheless, the whole business of simply writing code is only the last step in a larger, usually much more involved process.

That larger, more involved process deals primarily with finding solutions to problems. Only once we've found a way to solve the problem can we write code in our chosen language to carry out that solution. Now, it's certainly true that in many cases, we want to use an iterative approach to a solution -- we don't expect the first code we write to be the final an ultimate solution to a real problem for real users. Nonetheless, when we write code, we need to write it to do something, and we need to decide on that something before we can write code to do it.

Now, the question already notes that 89 is a winning "position". That immediately leads us to two questions: 1) why is it a winning position? 2) Based on that, can we force a score of 89?

The answer to the first is simple: 89 is a winning position, because from there our opponent can't reach the goal of 100, but no matter what move he makes, in the subsequent move we can reach 100. That is so because 100-89 = 11. The smallest move the opponent can make is 1 and the largest is 10, so no matter what he chooses, the remainder will be between 1 and 10, and we can win.

That leads fairly directly to the next step: we can force a score of 89 if we can reach a score of 78 -- and we can force a score of 78 if we can reach a score of 67 (and so on, counting backwards by 11 until we reach 12 and finally 1.

Therefore, the first player can force a win if and only if he chooses 1 in his first turn. If he chooses anything larger, the second player can choose a number that gives a result of 12, guaranteeing himself the win.

This is a large part of why many of the better computer science programs have classically used (more or less) functional languages like Scheme, and emphasized recursion as a basic primitive. It's not really about recursion as a flow control primitive or anything like that. Yes, we've known for years that a for loop, while loop, etc. can be expressed via recursion (and likewise that anything we can do with recursion can also be done via other flow control primitives).

What's important isn't that recursion allows arbitrary flow control. What's important here is the way of thinking about and approaching the problem. Recursion just gives a simple and direct way to approach the problem: instead of solving for 100, if we know that 89 is a winning position we can solve for 89 instead. Note: we could just about as easily think of it mathematically. Instead of thinking of this as a case of recursion, we could think of it as an inductive proof. The steps I've outlined above are essentially the same as those for an inductive proof.

Now, it's certainly true that this approach isn't suitable for every programming problem. If you're dealing with something like: "why is this text field being computed as 134 pixels wide when it should be 135?", chances are recursion isn't going to enter into your thinking, and might not be a lot of help if it did.

Nonetheless, if you're trying to solve a problem, and can define one step toward a solution, chances are very good that an inductive/recursive approach will be useful in solving the whole problem. It's most obvious in a case like this where the problem and solution are mathematical, so we have a well-defined line from problem to solution, but it can also apply to quite a few other cases as well (including the previous problem about text width, which it's sometimes useful to think of in terms of "some arbitrary string followed by one character", and continue with shorter and shorter strings until we reach only one or two characters).

Thursday, January 24, 2013

Word frequency counting

Eric Battalio recently posted about Jumping into C++. Like many people who've written C++ at times, I found his post quite interesting and entertaining. A fair number of people posted their own ideas about how to accomplish the same task -- some concentrating on minimal source code, others on maximum speed, and so on.

After seeing his follow-up After the Jump, I decided one point could stand being addressed. In his followup, he says: "About reading a file using iostream. Don't. It is slower and difficult to manage for complicated file formats. This application was trivial but I will avoid in the future."

On this subject, I have to disagree. Iostreams do have some shortcomings, and some implementations are also (un-)fairly slow. Nonetheless, they also have some pretty major advantages if (a big if, here) you use them well. I hope he won't feel insulted if I point out that (at least in my opinion) there are better ways to use iostreams than the way he did it.

For those who haven't read his posts, the task he set was quite simple: count the unique words in a file. He decided to do a bit of parsing, however, so that something like "one,two" would not include the comma. Unfortunately, he didn't (IMO) do this very well, so that would be counted as "onetwo" instead of treating "one" and "two" as separate words.

That brings up the obvious question: can we read those correctly, without doing extra parsing to patch things up after the fact? Of course, there's not a whole lot of suspense -- pretty much anybody with an IQ higher than a mosquito's can probably figure out that I wouldn't ask the question in the first place unless the answer was yes.

The way to do this (or the way I'm going to advocate in this post, anyway) depends on part of how iostreams are designed. When you read a string from a stream, any leading whitespace characters in the stream are ignored, then non-whitespace characters are read into the string, and reading stops at the first whitespace.

About now, you're probably thinking: "Yes Jerry, we already got past the 'IQ higher than a mosquito's' question; I already knew that!" What you may not know, or at least may not have seriously considered (though a few of you probably have) is exactly how the iostream determines what is or is not whitespace. The answer to that is a little complex: the iostream has an associated locale. The locale, in turn, contains a ctype facet -- and the ctype facet is what determines classes of characters.

So, to ignore those commas (and hyphens, etc.) we simply need a ctype facet that classifies them as white space, tell the iostream to use a locale containing that ctype facet, and we can just read words. Since it seems reasonable to me that (for example) "Jerry's blog" should be read as two words, not three, I'll classify upper and lower case letters and apostrophes as "alphabetic" and everything else as "white space".

Using that, the code we get code that's fairly simple, readable, and as a bonus, even a bit faster (though that wasn't really the intent).

Friday, December 7, 2012

Heaps and heapsort

Like most programmers, quite some time ago I'd read about heaps and the heapsort. Being a dutiful student, I could quote the standard line, saying that a heap was an implicit tree where node N was less than (or greater than, depending on the type of heap) nodes 2N and 2N+1. Like (I think) most, I'd also implemented a heapsort -- but to be honest, most of really came down to a line by line translation from the description in Knuth Volume 3 into the syntax of the programming language I was using at the time (Pascal, the first time around, and later the same again in C).

For years, given that I'd written a Heap sort that worked, I considered my knowledge of heaps and the heap sort entirely adequate. To be entirely honest, however, during most of that time I never really fully understood heaps. In particular, I never really understood the 2N and 2N+1 -- why those particular numbers/nodes? What was the significance of 2N and 2N+1? Why not 3N or 1.5N+65 or something? In short, that "implicit tree" had never really gelled for me -- I'd read the words, but never truly understood what they meant or how the "tree" worked.

For some reason a few months ago my lack of understanding on that particular subject bubbled to the top of the heap (I know: puns are the death of real humor) and I decided to implement a heap (and heap sort) entirely on my own, based only on my recollection about 2N and 2N+1. Instead of going to the book and translating syntax, I sat down with a supply of paper and pencils (the most underrated programming tools ever) and re-invented the heap and the algorithms to create and maintain a heap from first principles.

So, what did I do to finally and truly understand a heap? It ultimately came down to one piece of paper:

And from there came true understanding at last. If we arranged the nodes by layers, it "just happened" that the children of every node N were nodes 2N and 2N+1. Better still, after three layers of the tree, it was pretty apparent that as/if I added more to the tree, the same relationships would hold. Only 30 years (or so) after I'd first written a heap sort, I finally understood what I was doing!

I finally also understood why people who understood heaps thought they were cool: since the "links" in the tree are all based on (relatively simple) math, it's a tree you can walk almost any direction you'd care to think about by simply doing the right math. Just for example, in a normal binary tree, the only way to get from a node to its parent is to traverse down the tree, and remember the parent on the way there so when you need it back, you can recall it from wherever you have it stored. Likewise, getting to the sibling of a node requires that you get back to the parent (see above) and then take the other branch downward to get to the sibling.

With a heap, however, both of these are simply trivial math: the children of node N are nodes 2N and 2N+1, so the parent of node M is node M/2 (using truncating/integer math). Likewise, the sibling of node N is either node N+1 or N-1 -- if N is even, the sibling is N+1, and if N is odd, the sibling is N-1.

So, for any given node, we can walk through the tree up, down, or sideways (literally) with a single, simple math operation. Better still, all the nodes are contiguous in memory, with no space used up on links, so it's also pretty cache-friendly in general.

There was one other point I ran into repeatedly as I worked at understanding what I was doing well enough to produce working code: the math all depends on 1-based indexing. Making it work right with 0-based indexing like most currently languages use was not trivial. In fact, I finally decided that for this purpose it was better to maintain the illusion of 1-based addressing for most of the code, and since I was writing this in C++, have a truly trivial class to handle the translation from 1-base to 0-based addressing (i.e., subtract one from every index I used). As simple of an operation as that seems like it should be, keeping it in one place still simplified the rest of the code tremendously (and made it much easier to see the relationship between the code and that fundamental 2N/2N+1 relationship that defines the heap).

Anyway, in case anybody cares to look at the resulting code, I've decided that trying to post significant amounts of code directly here is too painful to be worthwhile, so here is a link to it on I think this is long enough for now, but sometime soon I'll probably write a post going through that code in more detail, explaining how it works and why I did things the way I did.

Saturday, November 24, 2012

Adventures in monitors

Friday I broke with my usual habit of staying home on the big shopping days, and went to Best Buy. I had some store credit, and was (and should still be) working on a project where I'd get a pretty major benefit from having another monitor attached to my computer. Besides, while my current monitor was quite high-end when it was new (a LaCie 321), it was getting quite old -- something like 7 or 8 years, if I recall correctly. Over the last few years, nearly anytime somebody heard about my ancient monitor, they'd tell me about how much better monitors had gotten since then, so even a really cheap, entry-level monitor would be a huge improvement.

In terms of specs, almost an new monitor should be a pretty serious upgrade. Just for one obvious example, the LaCie is only rated at 500:1 contrast ratio. Definitely pretty unimpressive compared to the ratios claimed by nearly all new monitors (anymore, the bare minimum seems to be at least a few million to one, and five or even ten million to one isn't particularly hard to find). Granted, those are "dynamic" contrast ratios, but the static ratios are still better than my ancient LaCie

After looking around a bit, I found an HP W2071d, which is a 20" wide-screen monitor, that was marked down to only $79.95 (from a list price of $149.95, for whatever that's worth). Seemed like a decent enough deal, so I picked one up, brought it home, and connected it up. As usual for me, just about as soon as it was connected it up, I pulled out my trusty Eye-one display 2 and calibrated it so the brightness, contrast, color (etc.) should match up with the LaCie (in case anybody cares: 5200K whitepoint, 110 cd/m2 luminance, gamma of 2.2).

Anyway, let's get to the meat of things: just how big of an improvement did the new monitor bring? Well, let me show you a couple of pictures. The HP is on the left, the LaCie on the right:

Here, the two don't look drastically different -- the background in both is reasonably black (though in person, the background of the LaCie is definitely blacker). The picture on the HP is a bit dark, but not too terrible. But now look what happens when I quite crouching down quite as far, and aim the camera down at the screen at something like a 30 degree angle:

Now, especially toward the bottom right corner, the HP's background isn't even close to black. Despite a higher claimed contrast ratio (600:1 static, 5000000:1 dynamic) the contrast between the picture and the background on the HP is looking pretty minimal. Worse, while the picture looks pretty much the same from either angle on the LaCie, on the HP it's gone from a bit dark and shadowy to completely washed out -- to the point that the stripes on the sweater have almost completely disappeared!

My conclusion: both the specifications and people claiming current entry-level monitors have surpassed high-end monitors from yesteryear are basically full of crap1. The new monitor undoubtedly works for spreadsheets or word processing, but the LaCie retains a significant edge in contrast, color accuracy and ability to view from different angles.

1 No, I'm not accusing HP of lying in the specifications -- I'm sure under the right circumstances, it meets its specification. Nonetheless, even after years of regular use, the blacks on the LaCie look deep and rich, while those of the brand new HP look weak and wimpy at best.