I have two files tab-separated. One contains lemmas and stems, and the other, what you need to form grammatical forms.
File (lemmas and stems):
Lemma Stem Pos
ablakzár ablakz noun
adminisztrátorlány adminisztrátorl noun
...
....
File (suffix):
suffix
[r]as
[r][r]er
...
.....
Rules to follow and output:
Lemma Stem Suffix Output
ablakzár ablakz [r]as ablakzras
adminisztrátorlány adminisztrátorl [r][r]er adminisztrátorlnnyer
These are the grammar forms that I would have to create from the two lemmas:
ablakzras
adminisztrátorlnnyer
That is, if I only find a letter in brackets, I pick the last consonant of the lemma and I add it to the stem, if I find two letters in brackets, I double the last consonant and I add it to the stem. Adding also what continued after the letters in brackets.
The table to double consonants:
Single: b c cs d dz dzs f g gy h j k l ly m n ny p q r s sz t ty v w x y z zs
Doubles: bb cc ccs dd ddz ddzs ff gg ggy hh jj kk ll lly mm nn nny pp qq rr ss ssz tt tty vv ww xx yy zz zzs
Finally, I solved the problem myself. I show the solution in case it serves to any OP:
BEGIN {
OFS=FS="\t";
while ((getline line < file ) > 0)
{
models[++c]=line;
}
v="a o u ö ü e i á ó ú ő ű é í";
a1=split(v,vocals," ");
doubled_consonants["b"]="bb"; doubled_consonants["c"]="cc";
doubled_consonants["cs"]="ccs"; doubled_consonants["d"]="dd";
doubled_consonants["dz"]="ddz"; doubled_consonants["dzs"]="ddzs";
doubled_consonants["f"]="ff"; doubled_consonants["g"]="gg";
doubled_consonants["gy"]="ggy"; doubled_consonants["h"]="hh";
doubled_consonants["j"]="jj"; doubled_consonants["k"]="kk";
doubled_consonants["l"]="ll"; doubled_consonants["ly"]="lly";
doubled_consonants["m"]="mm"; doubled_consonants["n"]="nn";
doubled_consonants["ny"]="nny"; doubled_consonants["p"]="pp";
doubled_consonants["q"]="qq"; doubled_consonants["r"]="rr";
doubled_consonants["s"]="ss"; doubled_consonants["sz"]="ssz";
doubled_consonants["t"]="tt"; doubled_consonants["ty"]="tty";
doubled_consonants["v"]="vv"; doubled_consonants["w"]="ww";
doubled_consonants["x"]="xx"; doubled_consonants["y"]="yy";
doubled_consonants["z"]="zz"; doubled_consonants["zs"]="zzs";
}
{
s1=split($1,lemma_letters,"")
for (i=1; i<=c; i++)
{
s2=split(mod[i],model,"\t");
s3=split(model[4],suffix_letters,"");
for (j=1; j<=s3; j++)
{
switch (suffix_letters[j]) {
case "[":
wz=extrac_consonant($1,s1,doubled_consonants)
wa=double_single(j,s3,suffix_letters)
if (wa == 0)
{
tp=tp wz;
j+=2;
}
else
{
tp=tp doubled_consonants[wz];
j+=5;
}
break;
default:
tp=tp ltrs[j];
break;
}
}
}
function extrac_consonant(string,leng,double)
{
# string - lemma
# leng - lemma length
# double - array (doubled_consonants)
q1=substr(string,(leng-2));
q2=substr(string,(leng-1));
q3=substr(string,leng);
if (double[q1])
{
cons=q1;
}
else if (double[q2])
{
cons=q2;
}
else
{
cons=q3;
}
return cons;
}
function double_single(x5,x6,arr5)
{
# x5 - j value in switch statement
# x6 - suffix length
# arr5 - array (suffix_letters)
flag=0;
for (g=(x5+1); g<=x6; g++)
{
if (arr5[g] == "[")
{
flag=1;
}
}
return flag; # It tells us, if we have to double the consonant or not [r] ó [r][r]
}
awk
to the rescue! I've coded a prototype for you which you can extend further
$ awk 'NR>1{n=gsub("\\[","[",$1);
c=substr($2,length($2),1);
sub("\\[.*\\]","",$1);
for(i=1;i<=n;i++) $1=c""$1;
print $3 $1}' <(paste suffix lemma)
ablakzras
adminisztrátorlyyer
note, it picks the last character, doesn't check for vowel/consonant. I'm not sure what you do with the double-consonants?
Code should be easy to understand, if not I can explain further.