Search code examples
linuxbashshellgrepbsd

How to count the times a word appears in a file using a shell?


Given a file containing text, I would like to count the occurence of a string "ABCDXYZ" ?

$ cat file.txt 
foo
bar 
foo
bar
baz
baz
bug
bat
foo
bar
so 
on 
and
so 
on
foo

Let's count foo!


Solution

  • Many times I see people using the following to count words:

    $ grep -o 'foo' file.txt | wc -l
    

    Here are a few examples: 1, 2, 3 and even this youtube video.
    This really a bad way, for a few reasons:

    1. It shows you never read man grep either BSD grep (NetBSD, OpenBSD, FreeBSD) or GNU grep
    2. All of these implementations offer you the option to count things -c. The NetBSD man page describes this options very clearly:
       -c, --count
              Suppress  normal output; instead print a count of matching lines
              for each input file.  With the -v,  --invert-match  option  (see
              below), count non-matching lines.
    

    you can use just one command:

     $ grep foo -c file.txt 
    

    Not only you could, you should and you'll save yourself lot's time of searching by reading man pages, and understanding the tools you have in hand!

    Speed bonus You can also make your greps faster, because pipes are quite expensive. One the short file shown above a pipe is 2 times slower comparing to using the option -c:

    $ time grep foo -c file.txt 
    4
    
    real    0m0.001s
    user    0m0.000s
    sys 0m0.001s
    $ time grep -o 'foo' file.txt | wc -l
    4
    
    real    0m0.002s
    user    0m0.000s
    sys 0m0.003s
    

    On large files this can be even more significant. Here I copied my file to a larger time a hundred thousand times:

    $ for i in `seq 1 300000`; do cat file.txt >> largefile.txt; done
    ^C
    $ wc -l largefile.txt 
    1111744 largefile.txt
    

    Now here is how slow is using pipe:

    $ time grep -o foo largefile.txt | wc -l
    277936
    
    real    0m0.216s
    user    0m0.214s
    sys 0m0.010s
    

    And here is how fast is only using grep:

     $ time grep -c foo largefile.txt 
    277936
    
    real    0m0.032s
    user    0m0.028s
    sys 0m0.004s
    

    These benchmarks where done on a machine with Core i5 and plentty of RAM, it would have been significantly on an embeded device with little RAM and CPU resources.

    To sum, don't use pipes where you don't need them. Often UNIX tools have overlapping functionalities. Know your tools, read how to use them!

    To count the occurence of a word in a file it's enough to use:

    $ grep -c <word> <filename>