Does anyone know how to or have some code on counting the number of unique phrases in a document? (Single word, two word phrases, three word phrases).
Thanks
Example of what I'm looking for:
What I mean is I have a text document, and i need to see what the most popular word phrases are. Example text
I took the car to the car wash.
I : 1 took : 1 the : 2 car: 2 to : 1 wash : 1 I took : 1 took the : 1 the car : 2 car to : 1 to the : 1 car wash : 1 I took the : 1 took the car : 1 the car to : 1 car to the : 1 to the car : 1 the car wash : 1 I took the car to : 1 took the car to the : 1 the car to the car : 1 car to the car wash : 1
I need the phrase, and the count that it shows up.
Any help would be appreciated. The closet thing I found to this was a PHP script from http://tools.seobook.com/general/keyword-density/source.php
I used to have some code for this, but I cannot find it.
Here is some initial code that solves your problem.
function CountWordSequences(const s:string; Counts:TStrings = nil):TStrings;
var
words, seqs : TStrings;
nw,i,j:integer;
t :string;
begin
if Counts=nil then Counts:=TStringList.Create;
words:=TStringList.Create; // build a list of all words
words.DelimitedText:=s;
seqs:=TStringList.Create;
for nw:=1 to words.Count do // build a list of all word sequences
begin
for i:=0 to words.Count-nw do
begin
t:='';
for j:=0 to nw-1 do
begin
t:=t+words[i+j];
if j<>nw-1 then t:=t+' ';
end;
seqs.Add(t);
end;
end;
words.Destroy;
for i:=0 to seqs.Count-1 do // count repeated sequences
begin
j:=Counts.IndexOf(seqs.Strings[i]);
if j=-1 then
Counts.AddObject(seqs.Strings[i],TObject(1))
else
Counts.Objects[j] := TObject(Succ(Integer(Counts.Objects[j])));
end;
seqs.Destroy;
result:=Counts;
end;
You will need to elaborate this code for real world production, for example, by recognizing more word delimiters (not only blanks), and by implementing some sort of case insensitivity.
To test it, put a Button, an EntryField and a Memo in a Form, and add the following code.
procedure TForm1.Button1Click(Sender: TObject);
var i:integer; l:TStrings;
begin
l:=CountWordSequences(edit1.Text,TStringList.Create);
for i:=1 to l.count do
memo1.Lines.Add('"'+l.Strings[i-1]+'": '+inttostr(Integer(l.Objects[i-1])));
end;
I first try with I took the car to the car wash
gives
"I": 1
"took": 1
"the": 2
"car": 2
"to": 1
"wash.": 1
"I took": 1
"took the": 1
"the car": 2
"car to": 1
"to the": 1
"car wash.": 1
"I took the": 1
"took the car": 1
"the car to": 1
"car to the": 1
"to the car": 1
"the car wash.": 1
"I took the car": 1
"took the car to": 1
"the car to the": 1
"car to the car": 1
"to the car wash.": 1
"I took the car to": 1
"took the car to the": 1
"the car to the car": 1
"car to the car wash.": 1
"I took the car to the": 1
"took the car to the car": 1
"the car to the car wash.": 1
"I took the car to the car": 1
"took the car to the car wash.": 1
"I took the car to the car wash.": 1