Search code examples
bashgrepcounting

How to count frequency of a word without counting compound words in bash?


I am using this to count the frequency in a text file using bash.

grep -ow -i "and" $1 | wc -l

It counts all the and in the file, including those that are part of compound words, like jerry-and-jeorge. These I wish to ignore and count all other independent and.


Solution

  • With a GNU grep, you can use the following command to count and words that are not enclosed with hyphens:

    grep -ioP '\b(?<!-)and\b(?!-)' "$1" | wc -l
    

    Details:

    • P option enables the PCRE regex syntax
    • \b(?<!-)and\b(?!-) matches
      • \b - a word boundary
      • (?<!-) - a negative lookbehind that fails the match if there is a hyphen immediately to the left of the current location
      • and - a fixed string
      • \b - a word boundary
      • (?!-) - a negative lookahead that fails the match if there is a hyphen immediately to the right of the current location.

    See the online demo:

    #!/bin/bash
    s='jerry-and-jeorge, and, aNd, And.'
    grep -ioP '\b(?<!-)and\b(?!-)' <<< "$s" | wc -l
    # => 3 (not 4)