Search code examples
regexsedgnuwin32

Remove duplicate words in a line with sed GnuWin32


I'm trying to remove repeated words in a text. The same issue described at these articles: Remove duplicate words in a line with sed and there: Removing duplicate strings with SED But these variants not work for me. May be becouse I'm using GnuWin32

Example what result I need:

Input

One two three bird animal two bird

Output

One two three bird animal

Solution

  • The tool sed is not really designed for this work. sed only has two forms of memory, the pattern-space and the hold-space, which are nothing more then two simple strings it can remember. Every time you do an operation on such memory-block, you have to rewrite the full memory block and reanalyze it. Awk, on the other hand, has a bit more flexibility in here and makes it easier to manipulate the lines in question.

    awk '{delete s}
         {for(i=1;i<=NF;++i) if(!(s[$i]++)) printf (i==1?"":OFS)"%s",$i}
         {printf ORS}' file
    

    But since you work on windows machine, it also means you have CRLF line-endings. This might create slight problems with the last entry. If the line reads:

    foo bar foo
    

    awk would read it as

    foo bar foo\r
    

    and thus the last foo will not match the first foo due to the CR.

    A correction would now read:

    awk 'BEGIN{RS=ORS="\r\n"}
         {delete s}
         {for(i=1;i<=NF;++i) if(!(s[$i]++)) printf (i==1?"":OFS)"%s",$i}
         {printf ORS}' file
    

    This can be used since you use CygWin which is in the end GNU, so we can use the extension on of RS to be a regex or multi-character value.

    If you want case-sensitivity you can replace s[$i] with s[tolower($i)].

    There are still issues with sentences like

    "There was a horse in the bar, it ran out of the bar."
    

    The word bar could be matched here, but the , and . make it not match. This can be solved with:

    awk 'BEGIN{RS=ORS="\r\n"; ere="[,.?:;\042\047]"}
         {delete s}
         {for(i=1;i<=NF;++i) {
            key=tolower($i); sub("^" ere,"",key); sub(ere "$","",key)
            if(!(s[key]++)) printf (i==1?"":OFS)"%s",$i
          } 
         }
         {printf ORS}' file
    

    This essentially does the same, but removes the punctuation marks at the beginning and end of a word. The punctuation marks are listed in ere