I have a function that counts frequencies of Trigrams in text. No knowledge of Computational Linguistics required, I just need help with Perl code.
This is the Function:
sub extract_frequencies {
for( my $i=0; $i<=$#tag; $i++ ) {
$wordtagfreq{"$word[$i]\t$tag[$i]"}++;
$tagfreq{$tag[$i]}++;
}
# count Tag-Trigramm-Frequencies
my @start = ("<s>","<s>");
unshift @tag, @start; # korrigiert
push @tag, "<s>";
for( my $i=2; $i<=$#tag; $i++ ) {
$ngramfreq[3]{"$tag[$i-2]\t$tag[$i-1]\t$tag[$i]"}++;
}
}
The particular code points that I do not understand are the following:
1) $ngramfreq
[3]
What does the Index on the hash means here? Do I count for each Tag separately? Is it the length of the key? What is my end key (3 different tag keys?)?
2) $i<=$#tag
What does $#
in Perl mean?
Haven't used Perl in a while, so I hope some Perl Monks will help me.
[0]
is an array index, nothing to do with a hash. This implies that ngramfreq
is actually an array of hashes:
my @ngramfreq = (
{ tag => 1, fish => 3 },
{ anothertag => 4 }
);
And thus $ngramfreq[0]
gets you the first anon hash, and then you can access the tag.
$#tag
is the last index in the array @tag
. So with 3 elements, it would be 2, because the array indicies are 0,1,2
Data::Dumper
is a good way of visualising a structure, to give you an idea of how it's layed out.
perldoc perldsc
is worth a read, as it expands on data structures.