Thursday, January 24, 2013

Word frequency counting

Eric Battalio recently posted about Jumping into C++. Like many people who've written C++ at times, I found his post quite interesting and entertaining. A fair number of people posted their own ideas about how to accomplish the same task -- some concentrating on minimal source code, others on maximum speed, and so on.

After seeing his follow-up After the Jump, I decided one point could stand being addressed. In his followup, he says: "About reading a file using iostream. Don't. It is slower and difficult to manage for complicated file formats. This application was trivial but I will avoid in the future."

On this subject, I have to disagree. Iostreams do have some shortcomings, and some implementations are also (un-)fairly slow. Nonetheless, they also have some pretty major advantages if (a big if, here) you use them well. I hope he won't feel insulted if I point out that (at least in my opinion) there are better ways to use iostreams than the way he did it.

For those who haven't read his posts, the task he set was quite simple: count the unique words in a file. He decided to do a bit of parsing, however, so that something like "one,two" would not include the comma. Unfortunately, he didn't (IMO) do this very well, so that would be counted as "onetwo" instead of treating "one" and "two" as separate words.

That brings up the obvious question: can we read those correctly, without doing extra parsing to patch things up after the fact? Of course, there's not a whole lot of suspense -- pretty much anybody with an IQ higher than a mosquito's can probably figure out that I wouldn't ask the question in the first place unless the answer was yes.

The way to do this (or the way I'm going to advocate in this post, anyway) depends on part of how iostreams are designed. When you read a string from a stream, any leading whitespace characters in the stream are ignored, then non-whitespace characters are read into the string, and reading stops at the first whitespace.

About now, you're probably thinking: "Yes Jerry, we already got past the 'IQ higher than a mosquito's' question; I already knew that!" What you may not know, or at least may not have seriously considered (though a few of you probably have) is exactly how the iostream determines what is or is not whitespace. The answer to that is a little complex: the iostream has an associated locale. The locale, in turn, contains a ctype facet -- and the ctype facet is what determines classes of characters.

So, to ignore those commas (and hyphens, etc.) we simply need a ctype facet that classifies them as white space, tell the iostream to use a locale containing that ctype facet, and we can just read words. Since it seems reasonable to me that (for example) "Jerry's blog" should be read as two words, not three, I'll classify upper and lower case letters and apostrophes as "alphabetic" and everything else as "white space".

Using that, the code we get code that's fairly simple, readable, and as a bonus, even a bit faster (though that wasn't really the intent).

No comments:

Post a Comment