This website offers the "Schinke Latin stemming algorithm" for download to use it in the Snowball stemming system.
I want to use this algorithm, but I don't want to use Snowball.
The good thing: There's some pseudocode on that page which you could translate to a PHP function. This is what I've tried:
<?php
function stemLatin($word) {
// output = array(NOUN-BASED STEM, VERB-BASED STEM)
// DEFINE CLASSES BEGIN
$queWords = array('atque', 'quoque', 'neque', 'itaque', 'absque', 'apsque', 'abusque', 'adaeque', 'adusque', 'denique', 'deque', 'susque', 'oblique', 'peraeque', 'plenisque', 'quandoque', 'quisque', 'quaeque', 'cuiusque', 'cuique', 'quemque', 'quamque', 'quaque', 'quique', 'quorumque', 'quarumque', 'quibusque', 'quosque', 'quasque', 'quotusquisque', 'quousque', 'ubique', 'undique', 'usque', 'uterque', 'utique', 'utroque', 'utribique', 'torque', 'coque', 'concoque', 'contorque', 'detorque', 'decoque', 'excoque', 'extorque', 'obtorque', 'optorque', 'retorque', 'recoque', 'attorque', 'incoque', 'intorque', 'praetorque');
$suffixesA = array('ibus, 'ius, 'ae, 'am, 'as, 'em', 'es', ia', 'is', 'nt', 'os', 'ud', 'um', 'us', 'a', 'e', 'i', 'o', 'u');
$suffixesB = array('iuntur', 'beris', 'erunt', 'untur', 'iunt', 'mini', 'ntur', 'stis', 'bor', 'ero', 'mur', 'mus', 'ris', 'sti', 'tis', 'tur', 'unt', 'bo', 'ns', 'nt', 'ri', 'm', 'r', 's', 't');
// DEFINE CLASSES END
$word = strtolower(trim($word)); // make string lowercase + remove white spaces before and behind
$word = str_replace('j', 'i', $word); // replace all <j> by <i>
$word = str_replace('v', 'u', $word); // replace all <v> by <u>
if (substr($word, -3) == 'que') { // if word ends with -que
if (in_array($word, $queWords)) { // if word is a queWord
return array($word, $word); // output queWord as both noun-based and verb-based stem
}
else {
$word = substr($word, 0, -3); // remove the -que
}
}
foreach ($suffixesA as $suffixA) { // remove suffixes for noun-based forms (list A)
if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
$word = substr($word, 0, -strlen($suffixA)); // remove the suffix
break; // remove only one suffix
}
}
if (strlen($word) >= 2) { $nounBased = $word; } else { $nounBased = ''; } // add only if word contains two or more characters
foreach ($suffixesB as $suffixB) { // remove suffixes for verb-based forms (list B)
if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
switch ($suffixB) {
case 'iuntur', 'erunt', 'untur', 'iunt', 'unt': $word = substr($word, 0, -strlen($suffixB)).'i'; break; // replace suffix by <i>
case 'beris', 'bor', 'bo': $word = substr($word, 0, -strlen($suffixB)).'bi'; break; // replace suffix by <bi>
case 'ero': $word = substr($word, 0, -strlen($suffixB)).'eri'; break; // replace suffix by <eri>
default: $word = substr($word, 0, -strlen($suffixB)); break; // remove the suffix
}
break; // remove only one suffix
}
}
if (strlen($word) >= 2) { $verbBased = $word; } else { $verbBased = ''; } // add only if word contains two or more characters
return array($nounBased, $verbBased);
}
?>
My questions:
1) Will this code work correctly? Does it follow the algorithm's rules?
2) How could you improve the code (performance)?
Thank you very much in advance!
No, your function will not work, it contains syntax errors. For example you have unclosed quotes and you use a wrong switch
syntax.
Here is my rewrite of the function. As the pseudoalgorithm on that page isn't really precise I had to do some interpreting. I interpreted it in a way that the examples mentioned in this article work.
I also did some optimizations. The first one is that I define the word and suffix arrays static
. Thus all calls to this function share the same arrays which should be good fore performance ;)
Furthermore I adjusted the arrays so they can be used more effective. I changed the $queWords
array so it can be used for a fast hash-table lookup, not a slow in_array
. Furthermore I have saved the lengths for the suffixes in the array. Thus you don't need to compute them at runtime (which is really, really slow). I may have made more minor optimizations.
I don't know how much faster this code is, but it should be much faster. Furthermore it now works on the examples provided.
Here is the code:
<?php
function stemLatin($word) {
static $queWords = array(
'atque' => 1,
'quoque' => 1,
'neque' => 1,
'itaque' => 1,
'absque' => 1,
'apsque' => 1,
'abusque' => 1,
'adaeque' => 1,
'adusque' => 1,
'denique' => 1,
'deque' => 1,
'susque' => 1,
'oblique' => 1,
'peraeque' => 1,
'plenisque' => 1,
'quandoque' => 1,
'quisque' => 1,
'quaeque' => 1,
'cuiusque' => 1,
'cuique' => 1,
'quemque' => 1,
'quamque' => 1,
'quaque' => 1,
'quique' => 1,
'quorumque' => 1,
'quarumque' => 1,
'quibusque' => 1,
'quosque' => 1,
'quasque' => 1,
'quotusquisque' => 1,
'quousque' => 1,
'ubique' => 1,
'undique' => 1,
'usque' => 1,
'uterque' => 1,
'utique' => 1,
'utroque' => 1,
'utribique' => 1,
'torque' => 1,
'coque' => 1,
'concoque' => 1,
'contorque' => 1,
'detorque' => 1,
'decoque' => 1,
'excoque' => 1,
'extorque' => 1,
'obtorque' => 1,
'optorque' => 1,
'retorque' => 1,
'recoque' => 1,
'attorque' => 1,
'incoque' => 1,
'intorque' => 1,
'praetorque' => 1,
);
static $suffixesNoun = array(
'ibus' => 4,
'ius' => 3,
'ae' => 2,
'am' => 2,
'as' => 2,
'em' => 2,
'es' => 2,
'ia' => 2,
'is' => 2,
'nt' => 2,
'os' => 2,
'ud' => 2,
'um' => 2,
'us' => 2,
'a' => 1,
'e' => 1,
'i' => 1,
'o' => 1,
'u' => 1,
);
static $suffixesVerb = array(
'iuntur' => 6,
'beris' => 5,
'erunt' => 5,
'untur' => 5,
'iunt' => 4,
'mini' => 4,
'ntur' => 4,
'stis' => 4,
'bor' => 3,
'ero' => 3,
'mur' => 3,
'mus' => 3,
'ris' => 3,
'sti' => 3,
'tis' => 3,
'tur' => 3,
'unt' => 3,
'bo' => 2,
'ns' => 2,
'nt' => 2,
'ri' => 2,
'm' => 1,
'r' => 1,
's' => 1,
't' => 1,
);
$stems = array($word, $word);
$word = strtr(strtolower(trim($word)), 'jv', 'iu'); // trim, lowercase and j => i, v => u
if (substr($word, -3) == 'que') {
if (isset($queWords[$word])) {
return array($word, $word);
}
$word = substr($word, 0, -3);
}
foreach ($suffixesNoun as $suffix => $length) {
if (substr($word, -$length) == $suffix) {
$tmp = substr($word, 0, -$length);
if (isset($tmp[1]))
$stems[0] = $tmp;
break;
}
}
foreach ($suffixesVerb as $suffix => $length) {
if (substr($word, -$length) == $suffix) {
switch ($suffix) {
case 'iuntur':
case 'erunt':
case 'untur':
case 'iunt':
case 'unt':
$tmp = substr_replace($word, 'i', -$length, $length);
break;
case 'beris':
case 'bor':
case 'bo':
$tmp = substr_replace($word, 'bi', -$length, $length);
break;
case 'ero':
$tmp = substr_replace($word, 'eri', -$length, $length);
break;
default:
$tmp = substr($word, 0, -$length);
}
if (isset($tmp[1]))
$stems[1] = $tmp;
break;
}
}
return $stems;
}
var_dump(stemLatin('aquila'));
var_dump(stemLatin('portat'));
var_dump(stemLatin('portis'));