Creating a Word Iterator

By James M. Curran

Originally published in The C/C++ Users Journal, Aug 1998
Copyright © 1998, Miller Freeman, Inc. >This is as it was originally proposed. The final editted version was slightly different

One of the interesting effects of the introduction of the Standard Template Library in the C++ Standard is that it makes us reconsider what are collections and what is being collected. For example, the common view of a string is that it is a collection of characters, and the Standard library includes an iterator to step through a string one character at a time. However, at a higher level, many applications would consider a string holding a sentence, to be a collection of words. What we need is an iterator to step through a string one word at a time. I designed WordIter for just that purpose.

The first step was to decide what kind of iterator I’d want it to be. The standard defines several different types, based on their powers and limitations. The decision here didn’t take very long. I saw that the vast majority of what I wanted to with the iterator could be done just be reading the string, and that trying to modify the string via the iterator would be more of a hassle then it was worth. So, I decided to disallow assignment (*Iter on the left side of equals). In the parlance of the Standard, this is called a “Input Iterator”.

This then defines the list of member functions we must include to have a proper iterator, namely, operator* (on the right hand side), operator++, and operator==. (WordIter.h, listing 1) operator-> should also be on that list, except I could find no meaningful way of representing it. It would have to return the address within the container of the string representing the current word. However, no such string exists as the container itself is a single string. I could only return a copy of the current word, packaged as a string, but that is what operator* is for, and therefore, an function that would accept a copy, would be using operator* instead of operator->. I have not found a case that needed this missing function.

We must also announce to the compiler that this is a Input Iterator. The Standard handles this through the iterator<> template, from which WordIter is derived. The main reason for this, is so if a library has several implementations of an algorithm, a generic one which uses Input Iterators and optimized one which needs Random Access Iterators for example, the compiler can choose the correct overload. Iterator<> also defines some typedefs used in the standard algorithms.

The implementation of the class is fairly straightforward. Pointers to constant chars point to the beginning and end of the current word. Since no resources are allocated, the default destructor and copy constructor will be sufficient. The heavy lifting is done by the private member function findword(), which steps through the string, setting the beginning and ending pointers. (WordIter.cpp, Listing 2). It is in this function that our concept of a “word” is defined. The findword() given here considers any series of non-whitespace character between whitespaces characters to be a word. This was functional for my purposes, but you may want to reconsider this is if you use WordIter in you work, as you considers punctuation as part of the word.

Two of the member functions do require special note. Operator* returns a string which is a copy of the current word. It makes use a little known constructor of the string class, which takes a char* and a integer length to create a string exactly that size.

Operator== requires extra thought, since we must consider what it means to have two WordIters equal. Must they point to the same spot in the same string, or can they be separate (but equal) strings? Actually, the answer to that question doesn’t matter that much, since the question operator== is usually used to ask is "Have we reached the end of the container?" This mean it’s first duty is the return true if one of it operands is the beyond-the-end iterator. Which mean we must first define one of those.

Normally, the container provides such an iterator value, with it’s end() member function. However, in this case, our container is a string, and while string does have a end() function, it returns a value of type char*. We need an object of type WordIter so that the beginning and ending iterator matched. So I had to define one of my own.

The convention used for occasions like this is to have the default constructor of an iterator equal a universal beyond-the-end value, so that any uninitialized iterator can be used for the purpose. Normally, this is done by call the ctor directly, creating an unnamed temporary (for_each(wi, WordIter(), PrintString); ) However, I also wanted to make sure that WordIter created from the pointer returned by string::end() would function correctly.

I decided that, in general, a beyond-the-end iterator would “point” to a byte of zero (as would a string’s beyond-the-end iterator), and for the generic one, I had it point to a particular static byte stored as a class private constant member. Then, in the actual operator function, I first check if the inital character of each word pointed to is the same. This would be true for both iterators which have reached the end, and for those that were equal by either of criteria given above. That quickly gets rid of most comparisons that will test false. Next, it does specific checks for the zero byte or if the two start pointers are the same. This implements the first choice above, chosen mainly because it would have a faster execution then the second option, which could be achieved by replacing the if()s with "if (strcmp(s, rhs.s) ==0)"

I also added a public static WordIter named EOS, which is equivalent to a beyond-to-end iterator, and can be used in it’s place (while ( wi != WordIter::EOS)). This prevents the need of constructing a new empty iterator each time a loop. Of course, this is needed only for the pedantic programmer, as the default constructor is trivial, but before I became a C programmer, I wrote in Assembler, where every byte is sacred.

Despite this only implementing the most limited of the iterator types, it can still be used with a large number of the Standard Library algorithms, such as copy(), count(), find(), for_each() and transform(). In addition, I’ve also found some standard functions, such as max_element() and min_element() which "officially" require a (more powerfully) Forward Iterator, but apparently do not actually use that extra power, and will work with an Input Iterator like WordIter. (See WordIterTest.cpp, Listing 3)

In cases where you really want to modify the structure of the sentence, it's best to use WordIter to copy the words into a more robust container. The Standard Library allows you to do this in one line using copy() and a back_inserter() as shown toward the end of Listing 3, so there is little need to enhance WordIter with the ability to alter the original string. (I have therefore left that "as an exercise for the student").

Note also that, like any Standard Library iterator, any change to the container into which an iterator points (in this case, the string holding the full sentence) invalidated the iterator, and any use of the iterator afterwards is undefined behavior.

For those using MFC, conversion from using strings to CStrings is a trivial matter, most of which can be done with your text editor's "replace all" command. The only manual editing needed is in the "create from string" ctor.