Search code examples
delphiword-count

How to accurately count words in a Memo?


I'm trying to make a Notepad clone, with the added feature of a running word count in the status bar. The word count is inaccurate, counting repeated spaces or carriage returns as new words. Here's what I've tried for the word count feature:

procedure TFormMain.Memo1Change(Sender: TObject);
var
  wordSeparatorSet: Set of Char;
  count: integer;
  i: integer;
  s: string;
  inWord: Boolean;
begin
  wordSeparatorSet := [#13, #32]; // CR, space
  count := 0;
  s := Memo1.Text;
  inWord := False;

  for i := 1 to Length(s) do
  begin
    // if the char is a CR or space, you're at the end of a word; increase the count
    if (s[i] in wordSeparatorSet) and (inWord=True) then
    begin
      Inc(count);
      inWord := False;
    end
    else
    // the char is not a delimiter, so you're in a word
    begin
      inWord := True;
    end;
  end;
  // OK, all done counting. If you're still inside a word, don't forget to count it too
  if inWord then
    Inc(count);

  StatusBar1.Panels[0].Text := 'Words: ' + IntToStr(count);
end;

Of course, I'm open to any alternatives or improvements. I really don't understand why this code increases the word count (count) with every space and carriage return. I would think after the user hits the space bar (incrementing count), the variable inWord should now be False, so if (s[i] in wordSeparatorSet) and (inWord=True) should resolve to False if the user hits the space bar or Enter key a second time. But that's not what happens.


Solution

  • I really don't understand why this code increases the word count (count) with every space and carriage return.

    At the first space after a word, you do indeed set inWord to False. So, if the next character is also a space, you will (erroneously) run inWord := True, so if the next (third) character is again a space, you will (erroneously) do Inc(count).

    You can also notice that the negation of (s[i] in wordSeparatorSet) and (inWord=True) does NOT imply that "the char is not a delimiter" because of the conjunction with inWord. The negation of (s[i] in wordSeparatorSet) and (inWord=True) is, by De Morgan, not (s[i] in wordSeparatorSet) or not (inWord=True), which is NOT the same thing as not (s[i] in wordSeparatorSet).

    A fixed version would look more like

    function WordCount(const AText: string): Integer;
    var
      InWord: Boolean;
      i: Integer;
    begin
      Result := 0;
      InWord := False;
      for i := 1 to Length(AText) do
        if InWord then
        begin
          if IsWordSep(AText[i]) then
            InWord := False;
        end
        else
        begin
          if not IsWordSep(AText[i]) then
          begin
            InWord := True;
            Inc(Result);
          end;
        end;
    end;
    

    where IsWordSep(chr) is defined as something like chr.IsWhitespace but there are many subtleties, as I discuss at length on my web site.