How Many Words in That Text?

One of the projects I have used for years is a letter counter program. The idea is to count the occurrences of each individual letter. It’s a nice project that includes arrays, loops, and some string manipulation. This is the sort of thing that does have some real world utility. Cryptography uses word counts to try to crack substitution cyphers. Linguists use it to study languages. And that is just two of what comes to mind.

The next logical (to me anyway) step is to count words. I’ve been thinking about adding this in for a while. It is actually something I was assigned as a project many years ago when I was an undergraduate. It’s not as simple as counting letters. The most obvious method involved counting spaces. What happens if someone is old school and places two spaces after every period? Well, that is something to take into consideration. And what about other white space like tabs or line feeds? Or special characters?

Doug Peterson related in a recent post (About words) that two different programs gave him two different word counts for the same piece of text. The counts were off by 3 on a text of about 486 words. Not a huge percentage but on a book length text that could make a difference. Some articles in magazines are paid by the word. That means getting the count right means money.

Now people can count words with greater accuracy though I don’t want to do it myself. At some point someone is going to feed a lot of data into some artificial intelligence. Long sections of text that have accurate (human counted perhaps) word counts will be fed in and the AI will learn what words are and how to count them. It’s not going to happen until someone decides that developing this is worth the time and money. I wonder if it will be an academic or an industry researcher?

For the time being I think this will make an interesting conversation in class. Maybe we’ll have a contest to see  who can come up with the most accurate algorithm?

Leave a Reply

Your email address will not be published. Required fields are marked *