Search code examples
bashunixawkuniquexargs

Awk counting occurrences strange behaviour


I need to count the number of occurrences of elements of the second column of a large number of files. The script I'm using is this:

{
 el[$2]++
}
END {
    for (i in el) {
    print i, el[i] >> "rank.txt"
    }
 }

For running it over a large number of files I'm using find | xargs this way :

find . -name "*.txt" | xargs awk -f script.awk

The problems is that if I count the number of lines of the output files rank.txt (with a wc -l rank.txt) the number I get (for example 7600) is bigger than the number of unique elements of the second row (for example 7300), that I obtain with a :

find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq | wc -l

In fact giving a :

awk '{print $1}' rank.txt | sort | uniq | wc -l

I obtain the right number of elements (following the example I'll gett 7300). So it means that the elements of the first column of the output files are not unique. But, this shouldn't happen!


Solution

  • This is probably combination of the fact that the input files (*.txt) contain non-unique elements, and the xargs functionality. Remember that xargs, when there is a large number of files, is called repeatedly with different set of arguments. This means that in the first example, if there is larger number of files, some of the files are not processed in one awk run, which results in higher number of "unique" elements in the output.

    You could try this:

    find . -name "*.txt" | xargs cat | awk -f script.awk