Search code examples
bashsedstemmingsuffix

Stemming a text file to remove suffixes given linewise in another file using sed


I have one file suffix.txt which contains some strings linewise, for example-

ing
ness
es
ed
tion

Also, I have a text file text.txt which contains some text, it is given that text.txt consists only of lowercase letters and without any punctuation, for example-

the raining cloud answered the man all his interrogation and with all
questioned mind the princess responded
harness all goodness without getting irritated

I want to remove the suffixes from the original words in text.txt only once for every suffix. Thus I expect the following output-

the rain cloud answer the man all his interroga and with all
question mind the princess respond
har all good without gett irritat

Note that tion was not removed from questioned since the original word didn't contain tion as a suffix. It would be really helpful if someone could answer this with sed commands. I was using a naive script that doesn't seem to do the job-

#!/bin/bash

while read p; do
  sed -i "s/$p / /g" text.txt;
  sed -i "s/$p$//g" text.txt;
done <suffix.txt

Solution

  • An awk:

    $ awk '
    NR==FNR {                   # generate a regex of suffices
        s=s (s==""?"(":"|") $0  # (ing|ness|es|ed|tion)$
        next
    }
    FNR==1 {
        s=s ")$"                # well, above )$ is inserted here
    }
    {
        for(i=1;i<=NF;i++)      # iterate all the words and
            sub(s,"",$i)        # apply regex to each of them
    }1' suffix text             # output
    

    Output:

    the rain cloud answer the man all his interroga and with all
    question mind the princess respond
    har all good without gett irritat