Search code examples
linuxfileduplicatesplaintext

How to remove duplicate words from a plain text file using linux command


I have a plain text file with words, which are separated by comma, for example:

word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3

i want to delete the duplicates and to become:

word1, word2, word3, word4, word5, word6, word7

Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....


Solution

  • Assuming that the words are one per line, and the file is already sorted:

    uniq filename
    

    If the file's not sorted:

    sort filename | uniq
    

    If they're not one per line, and you don't mind them being one per line:

    tr -s [:space:] \\n < filename | sort | uniq
    

    That doesn't remove punctuation, though, so maybe you want:

    tr -s [:space:][:punct:] \\n < filename | sort | uniq
    

    But that removes the hyphen from hyphenated words. "man tr" for more options.