Remove duplicate words in a line with sed GnuWin32

I'm trying to remove repeated words in a text. The same issue described at these articles: Remove duplicate words in a line with sed and there: Removing duplicate strings with SED But these variants not work for me. May be becouse I'm using GnuWin32

Example what result I need:

Input

One two three bird animal two bird

Output

One two three bird animal

Solution

The tool sed is not really designed for this work. sed only has two forms of memory, the pattern-space and the hold-space, which are nothing more then two simple strings it can remember. Every time you do an operation on such memory-block, you have to rewrite the full memory block and reanalyze it. Awk, on the other hand, has a bit more flexibility in here and makes it easier to manipulate the lines in question.

awk '{delete s}
     {for(i=1;i<=NF;++i) if(!(s[$i]++)) printf (i==1?"":OFS)"%s",$i}
     {printf ORS}' file

But since you work on windows machine, it also means you have CRLF line-endings. This might create slight problems with the last entry. If the line reads:

foo bar foo

awk would read it as

foo bar foo\r

and thus the last foo will not match the first foo due to the CR.

A correction would now read:

awk 'BEGIN{RS=ORS="\r\n"}
     {delete s}
     {for(i=1;i<=NF;++i) if(!(s[$i]++)) printf (i==1?"":OFS)"%s",$i}
     {printf ORS}' file

This can be used since you use CygWin which is in the end GNU, so we can use the extension on of RS to be a regex or multi-character value.

If you want case-sensitivity you can replace s[$i] with s[tolower($i)].

There are still issues with sentences like

"There was a horse in the bar, it ran out of the bar."

The word bar could be matched here, but the , and . make it not match. This can be solved with:

awk 'BEGIN{RS=ORS="\r\n"; ere="[,.?:;\042\047]"}
     {delete s}
     {for(i=1;i<=NF;++i) {
        key=tolower($i); sub("^" ere,"",key); sub(ere "$","",key)
        if(!(s[key]++)) printf (i==1?"":OFS)"%s",$i
      } 
     }
     {printf ORS}' file

This essentially does the same, but removes the punctuation marks at the beginning and end of a word. The punctuation marks are listed in ere