Search code examples
delphidelphi-2009phrasephrases

Delphi Phrase Count / Keyword Density


Does anyone know how to or have some code on counting the number of unique phrases in a document? (Single word, two word phrases, three word phrases).

Thanks

Example of what I'm looking for: What I mean is I have a text document, and i need to see what the most popular word phrases are. Example text

I took the car to the car wash.

I : 1
took : 1
the : 2
car: 2
to : 1
wash : 1
I took : 1
took the : 1
the car : 2
car to : 1
to the : 1
car wash : 1
I took the : 1
took the car : 1
the car to : 1
car to the : 1
to the car : 1
the car wash : 1
I took the car to : 1
took the car to the : 1
the car to the car : 1
car to the car wash : 1

I need the phrase, and the count that it shows up.

Any help would be appreciated. The closet thing I found to this was a PHP script from http://tools.seobook.com/general/keyword-density/source.php

I used to have some code for this, but I cannot find it.


Solution

  • Here is some initial code that solves your problem.

    function CountWordSequences(const s:string; Counts:TStrings = nil):TStrings;
    var
      words, seqs : TStrings;
      nw,i,j:integer;
      t :string;
    begin
      if Counts=nil then Counts:=TStringList.Create;
      words:=TStringList.Create;        // build a list of all words
      words.DelimitedText:=s;
      seqs:=TStringList.Create;
      for nw:=1 to words.Count do       // build a list of all word sequences
       begin
        for i:=0 to words.Count-nw do
         begin
          t:='';
          for j:=0 to nw-1 do
           begin
            t:=t+words[i+j];
            if j<>nw-1 then t:=t+' ';
           end;
          seqs.Add(t);
         end;
       end;
      words.Destroy;
      for i:=0 to seqs.Count-1 do         // count repeated sequences
       begin
        j:=Counts.IndexOf(seqs.Strings[i]);
        if j=-1 then
          Counts.AddObject(seqs.Strings[i],TObject(1))
        else
          Counts.Objects[j] := TObject(Succ(Integer(Counts.Objects[j])));
       end;
      seqs.Destroy;
      result:=Counts;
    end;
    

    You will need to elaborate this code for real world production, for example, by recognizing more word delimiters (not only blanks), and by implementing some sort of case insensitivity.

    To test it, put a Button, an EntryField and a Memo in a Form, and add the following code.

    procedure TForm1.Button1Click(Sender: TObject);
    var i:integer; l:TStrings;
     begin
      l:=CountWordSequences(edit1.Text,TStringList.Create);
      for i:=1 to l.count do
        memo1.Lines.Add('"'+l.Strings[i-1]+'": '+inttostr(Integer(l.Objects[i-1])));
     end;
    

    I first try with I took the car to the car wash

    gives

    "I": 1
    "took": 1
    "the": 2
    "car": 2
    "to": 1
    "wash.": 1
    "I took": 1
    "took the": 1
    "the car": 2
    "car to": 1
    "to the": 1
    "car wash.": 1
    "I took the": 1
    "took the car": 1
    "the car to": 1
    "car to the": 1
    "to the car": 1
    "the car wash.": 1
    "I took the car": 1
    "took the car to": 1
    "the car to the": 1
    "car to the car": 1
    "to the car wash.": 1
    "I took the car to": 1
    "took the car to the": 1
    "the car to the car": 1
    "car to the car wash.": 1
    "I took the car to the": 1
    "took the car to the car": 1
    "the car to the car wash.": 1
    "I took the car to the car": 1
    "took the car to the car wash.": 1
    "I took the car to the car wash.": 1