Search code examples
shellawkgrepuniqwc

Get the count of unique words in a file using grep and wc


need command to find the count of unique words in a file using grep

Tried using grep along with uniq and sort but need to find a way to use only grep and wc commands.these are the two ways in which am able to do but i need to do using only grep..

$ grep -oE '\w+' 'file.txt' | sort | uniq | wc -l
$ grep -oE '\w+' 'file.txt' > temp.txt && awk '!seen[$0]++' temp.txt | wc -l

Sample input file:

one two three four five
two four one six
eight three seven five

Output: unique word count: 8

Is it possible to first extract the words using the grep -oE '\w+' file.txt command then perform grep on each word to an empty file and append the word to the file if grep does not find the word to exist in that file.this way only those words which are not found in the new file will get appended to it? is it possible to do this using grep ?


Solution

  • Since your grep has -o I shall assume it also has -P and -z:

    grep -zPo '(?s)(\b\w+\b)(?!.*\b\1\b)' file.txt |
    grep -zc ^
    
    • use -z to make grep treat the entire file as a single "line" (since there should be no nulls in it)
    • use -P to enable Perl-compatible regular expressions (PCRE) which allow lookaround assertions
    • (?s) - tell PCRE that . should also match newlines
    • use a negative lookahead (?! ... ) to find the final occurrence of each word (i.e. word not followed by anything followed by itself)
      • \b\w+\b and \b\1\b exclude partial words
    • we use a lookahead so that the lookahead text is not consumed by the match and can be reused when looking for more final words
    • use -o to output each match on its own "line" (because of -z, nulls are used as the line ending character)
    • take the generated list of unique words and output the count of "lines"

    This will be very slow on larger files.