Search code examples
regexunixawklemmatization

AWK - forming grammar forms


I have two files tab-separated. One contains lemmas and stems, and the other, what you need to form grammatical forms.

File (lemmas and stems):

Lemma    Stem    Pos
ablakzár    ablakz  noun
adminisztrátorlány  adminisztrátorl noun
...
....

File (suffix):

suffix
[r]as
[r][r]er
...
.....

Rules to follow and output:

Lemma    Stem    Suffix    Output
ablakzár    ablakz    [r]as    ablakzras
adminisztrátorlány    adminisztrátorl    [r][r]er    adminisztrátorlnnyer


These are the grammar forms that I would have to create from the two lemmas:
ablakzras
adminisztrátorlnnyer

That is, if I only find a letter in brackets, I pick the last consonant of the lemma and I add it to the stem, if I find two letters in brackets, I double the last consonant and I add it to the stem. Adding also what continued after the letters in brackets.

The table to double consonants:

Single:  b c cs d dz dzs f g gy h j k l ly m n ny p q r s sz t ty v w x y z zs
Doubles: bb cc ccs dd ddz ddzs ff gg ggy hh jj kk ll lly mm nn nny pp qq rr ss ssz tt tty vv ww xx yy zz zzs

Finally, I solved the problem myself. I show the solution in case it serves to any OP:

BEGIN {
    OFS=FS="\t";

    while ((getline line < file ) > 0)
    {
        models[++c]=line; 
    }

    v="a o u ö ü e i á ó ú ő ű é í";
    a1=split(v,vocals," ");

    doubled_consonants["b"]="bb";  doubled_consonants["c"]="cc";
    doubled_consonants["cs"]="ccs";  doubled_consonants["d"]="dd";
    doubled_consonants["dz"]="ddz";  doubled_consonants["dzs"]="ddzs"; 
    doubled_consonants["f"]="ff";  doubled_consonants["g"]="gg";
    doubled_consonants["gy"]="ggy";  doubled_consonants["h"]="hh";
    doubled_consonants["j"]="jj";  doubled_consonants["k"]="kk";
    doubled_consonants["l"]="ll";  doubled_consonants["ly"]="lly";
    doubled_consonants["m"]="mm";  doubled_consonants["n"]="nn";
    doubled_consonants["ny"]="nny";  doubled_consonants["p"]="pp";
    doubled_consonants["q"]="qq";  doubled_consonants["r"]="rr";
    doubled_consonants["s"]="ss";  doubled_consonants["sz"]="ssz";
    doubled_consonants["t"]="tt";  doubled_consonants["ty"]="tty";
    doubled_consonants["v"]="vv";  doubled_consonants["w"]="ww";
    doubled_consonants["x"]="xx";  doubled_consonants["y"]="yy";
    doubled_consonants["z"]="zz";  doubled_consonants["zs"]="zzs";

}
{
    s1=split($1,lemma_letters,"")
    for (i=1; i<=c; i++)
    {
        s2=split(mod[i],model,"\t");
        s3=split(model[4],suffix_letters,"");

        for (j=1; j<=s3; j++)
        {
            switch (suffix_letters[j]) {
            case "[":
                wz=extrac_consonant($1,s1,doubled_consonants)
                wa=double_single(j,s3,suffix_letters)

                if (wa == 0)
                {
                    tp=tp wz;
                    j+=2;
                }
                else
                {
                    tp=tp doubled_consonants[wz];
                    j+=5;
                }

                break;
            default:
                tp=tp ltrs[j];
                break;
            }
        }
    }

function extrac_consonant(string,leng,double)
{
    # string - lemma
    # leng - lemma length
    # double - array (doubled_consonants)

    q1=substr(string,(leng-2)); 
    q2=substr(string,(leng-1)); 
    q3=substr(string,leng); 

    if (double[q1])
    {
        cons=q1;
    }
    else if (double[q2])
    {
        cons=q2;
    }
    else 
    {
        cons=q3;
    }
    return cons;
}
function double_single(x5,x6,arr5)
{
    # x5 - j value in switch statement
    # x6 - suffix length
    # arr5 - array (suffix_letters)

    flag=0;
    for (g=(x5+1); g<=x6; g++)
    {
        if (arr5[g] == "[")
        {
            flag=1;
        }
    }
    return flag; # It tells us, if we have to double the consonant or not [r] ó [r][r]
}

Solution

  • awk to the rescue! I've coded a prototype for you which you can extend further

    $ awk 'NR>1{n=gsub("\\[","[",$1);
                c=substr($2,length($2),1);
                sub("\\[.*\\]","",$1);
                for(i=1;i<=n;i++) $1=c""$1;
                print $3 $1}' <(paste suffix lemma)
    
    ablakzras
    adminisztrátorlyyer
    

    note, it picks the last character, doesn't check for vowel/consonant. I'm not sure what you do with the double-consonants?

    Code should be easy to understand, if not I can explain further.