Given a file containing text, I would like to count the occurence of a string "ABCDXYZ" ?
$ cat file.txt
foo
bar
foo
bar
baz
baz
bug
bat
foo
bar
so
on
and
so
on
foo
Let's count foo
!
Many times I see people using the following to count words:
$ grep -o 'foo' file.txt | wc -l
Here are a few examples: 1, 2, 3 and even this youtube video.
This really a bad way, for a few reasons:
man grep
either BSD grep (NetBSD, OpenBSD, FreeBSD) or GNU grep-c
.
The NetBSD man page describes this options very clearly:-c, --count Suppress normal output; instead print a count of matching lines for each input file. With the -v, --invert-match option (see below), count non-matching lines.
you can use just one command:
$ grep foo -c file.txt
Not only you could, you should and you'll save yourself lot's time of searching by reading man pages, and understanding the tools you have in hand!
Speed bonus
You can also make your grep
s faster, because pipes are quite expensive.
One the short file shown above a pipe is 2 times slower comparing to using the option -c
:
$ time grep foo -c file.txt
4
real 0m0.001s
user 0m0.000s
sys 0m0.001s
$ time grep -o 'foo' file.txt | wc -l
4
real 0m0.002s
user 0m0.000s
sys 0m0.003s
On large files this can be even more significant. Here I copied my file to a larger time a hundred thousand times:
$ for i in `seq 1 300000`; do cat file.txt >> largefile.txt; done
^C
$ wc -l largefile.txt
1111744 largefile.txt
Now here is how slow is using pipe:
$ time grep -o foo largefile.txt | wc -l
277936
real 0m0.216s
user 0m0.214s
sys 0m0.010s
And here is how fast is only using grep:
$ time grep -c foo largefile.txt
277936
real 0m0.032s
user 0m0.028s
sys 0m0.004s
These benchmarks where done on a machine with Core i5
and plentty of RAM, it would have been significantly on an embeded device with little RAM and CPU resources.
To sum, don't use pipes where you don't need them. Often UNIX tools have overlapping functionalities. Know your tools, read how to use them!
To count the occurence of a word in a file it's enough to use:
$ grep -c <word> <filename>