Search code examples
awktext

Extracting the root of words with awk


I have an awk script that finds word frequencies.

{$0 = tolower($0)}  {gsub(/[[:punct:]]/, "")} {for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]} 

I work with Turkish text. Turkish words mostly appear with suffixes.

A sample of results from this script:

kadınlar       1
kadınlara      1
kadınlarımızın 1
kadınlarına    1
kadınlarının   1

Here the root is “kadın” ("woman" in English).

So, “kadınlar” is “women”. “Kadınlara” is “to women” and so on.

Can awk extract the root “kadın” from these 5 words? Do we need to check a dictionary for this?

Expected output:

These 5 words with the same root (kadın),

kadınlar       1
kadınlara      1
kadınlarımızın 1
kadınlarına    1
kadınlarının   1

should be listed as such:

kadın 5

Solution

  • Rather than writing an awk script, it is probably simpler to use an existing tool.

    snowballstemmer appears to be available for python.

    I don't know python but it's easy enough to write something to use it:

    $ pip install snowballstemmer
    Defaulting to user installation because normal site-packages is not writeable
    Collecting snowballstemmer
      Downloading snowballstemmer-2.2.0-py2.py3-none-any.whl (93 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93.0/93.0 KB 2.6 MB/s eta 0:00:00
    Installing collected packages: snowballstemmer
    Successfully installed snowballstemmer-2.2.0
    $ cat >input <<'EOD'
    kadınlar
    kadınlara
    kadınlarımızın
    kadınlarına
    kadınlarının
    EOD
    $ cat >tstem <<'EOD'
    #!/usr/bin/python3
    
    import snowballstemmer
    stemmer = snowballstemmer.stemmer('turkish')
    
    for word in open('input','r').read().splitlines():
        print(word,"->",stemmer.stemWord(word))
    
    EOD
    $ chmod +x tstem
    $ ./tstem
    kadınlar -> kadın
    kadınlara -> kadın
    kadınlarımızın -> kadın
    kadınlarına -> kadın
    kadınlarının -> kadın
    $
    

    The most popular stemmer on github seems to be Turkish Stemmer for Python:

    $ pip install TurkishStemmer
    Defaulting to user installation because normal site-packages is not writeable
    Collecting TurkishStemmer
      Downloading TurkishStemmer-1.3-py3-none-any.whl (20 kB)
    Installing collected packages: TurkishStemmer
    Successfully installed TurkishStemmer-1.3
    $ cat >tstem2 <<'EOD'
    #!/usr/bin/python3
    
    from TurkishStemmer import TurkishStemmer
    stemmer = TurkishStemmer()
    
    for word in open('input','r').read().splitlines()
        print(word,"->",stemmer.stem(word))
    
    EOD
    $ chmod +x tstem2
    $ ./tstem2
    kadınlar -> kat
    kadınlara -> kadın
    kadınlarımızın -> kadın
    kadınlarına -> kadın
    kadınlarının -> kadın
    $
    

    This gets one wrong. (But perhaps it gets some right that snowballstemmer gets wrong?)


    A sample complete implementation:

    $ cat >tstem3 <<'EOD'
    #!/usr/bin/python3
    
    import sys
    import snowballstemmer
    stemmer = snowballstemmer.stemmer('turkish')
    
    for line in sys.stdin:
        for word in line.split():
            print(stemmer.stemWord(word))
    EOD
    $ chmod +x tstem3
    $ <original-input.txt tr '[:upper:]' '[:lower:]' |
      tr -s '[:punct:]' ' ' |
      ./tstem3 |
      sort |
      uniq -c
          5 kadın
    $