Search code examples
stringdelphitokenizedelphi-10.4-sydney

Extract string-token objects from string?


Does Delphi (10.4) have a string-tokenizer that extracts string-token-objects from a string in a similar way as below?

MyPhrase := 'I have a simple word and a complex Word: A lot of WORDS.';

MyTokens := MyTokenize(MyPhrase, 'word');

for i := 0 to MyTokens.Count - 1 do
  Memo1.Lines.Add(IntToStr(MyTokens[i].Pos) + ': ' + MyTokens[i].String);

Gives this result in Memo1:

16: word  
35: Word  
50: WORD

Searching for "tokenize string" in the Delphi documentation did not get any useful results for this purpose.

Of course, writing such a function is trivial, but I wonder if there already is a procedure for this in the existing huge Delphi code treasure.

EDIT: I am experimenting with a wordlist that should have the required features:

program MyTokenize;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  CodeSiteLogging,
  System.RegularExpressions,
  System.Types,
  System.Classes,
  System.StrUtils,
  System.SysUtils;

type
  PWordRec = ^TWordRec;

  TWordRec = record
    WordStr: string;
    WordPos: Integer;
  end;

  TWordList = class(TList)
  private
    function Get(Index: Integer): PWordRec;
  public
    destructor Destroy; override;
    function Add(Value: PWordRec): Integer;
    property Items[Index: Integer]: PWordRec read Get; default;
  end;

function TWordList.Add(Value: PWordRec): Integer;
begin
  Result := inherited Add(Value);
end;

destructor TWordList.Destroy;
var
  i: Integer;
begin
  for i := 0 to Count - 1 do
    FreeMem(Items[i]);
  inherited;
end;

function TWordList.Get(Index: Integer): PWordRec;
begin
  Result := PWordRec(inherited Get(Index));
end;

var
  WordList: TWordList;
  WordRec: PWordRec;
  i: Integer;

begin
  try
    //MyPhrase := 'A crossword contains words but not WORD';

    WordList := TWordList.Create;
    try
      // AV only at the THIRD loop!!!
      for i := 0 to 2 do
      begin
        GetMem(WordRec, SizeOf(TWordRec));
        WordRec.WordPos := i;
        WordRec.WordStr := IntToStr(i);
        WordList.Add(WordRec);
      end;

      for i := 0 to WordList.Count - 1 do
        Writeln('WordPos: ', WordList[i].WordPos, ' WordStr: ', WordList[i].WordStr);

      WriteLn('  Press Enter to free the list');
      ReadLn;
    finally
      WordList.Free;
    end;

  except
    on E: Exception do
    begin
      Writeln(E.ClassName, ': ', E.Message);
      ReadLn;
    end;
  end;
end.

Unfortunately, it has a strange bug: It gets an AV exactly at the THIRD for loop!

EDIT2: It seems that the AV happens only when the project's Build Configuration is set to Debug. When the project's Build Configuration is set to Release then there is no AV. Has this to do with the MemoryManager?


Solution

  • On request, here is how I would do this myself:

    Screenshot of application

    First, I want to create a function that performs this operation, so it can be reused every time we need to do this.

    I could have this function return or populate a TList<TWordRec>, but then it would be tiresome to work with it, because the user of the function would then need to add try..finally blocks every time the function is used. Instead, I let it return a TArray<TWordRec>. By definition, this is simply array of TWordRec, that is, a dynamic array of TWordRecs.

    But how to efficiently populate such an array? We all know you shouldn't increase the length of a dynamic array one element at a time; besides, that requires a lot of code. Instead, I populate a local TList<TWordRec> and then, as a last step, create an array from it:

    type
      TPhraseMatch = record
        Position: Integer;
        Text: string;
      end;
    
    function GetPhraseMatches(const AText, APhrase: string): TArray<TPhraseMatch>;
    begin
    
      var TextLower := AText.ToLower;
      var PhraseLower := APhrase.ToLower;
    
      var List := TList<TPhraseMatch>.Create;
      try
    
        var p := 0;
        repeat
          p := Pos(PhraseLower, TextLower, p + 1);
          if p <> 0 then
          begin
            var Match: TPhraseMatch;
            Match.Position := p - 1 {since the OP wants 0-based string indexing};
            Match.Text := Copy(AText, p, APhrase.Length);
            List.Add(Match);
          end;
        until p = 0;
    
        Result := List.ToArray;
    
      finally
        List.Free;
      end;
    
    end;
    

    Notice that I chose an alternative to the regular expression approach, just for educational reasons. I also believe this approach is faster. Also notice how easy it is to work with the TList<TWordRec>: it's just like a TStringList but with word records instead of strings!

    Now, let's use this function:

    procedure TWordFinderForm.ePhraseChange(Sender: TObject);
    begin
    
      lbMatches.Items.BeginUpdate;
      try
        lbMatches.Items.Clear;
        for var Match in GetPhraseMatches(mText.Text, ePhrase.Text) do
          lbMatches.Items.Add(Match.Position.ToString + ':'#32 + Match.Text)
      finally
        lbMatches.Items.EndUpdate;
      end;
    
    end;
    

    Had I not chosen to use a function, but placed all code in one block, I could have iterated over the TList<TWordRec> in exactly the same way:

    for var Match in List do