Search code examples
linuxshellawkfrequency-analysisword-frequency

Awk: Words frequency from one text file, how to ouput into myFile.txt?


Given a .txt files with space separated words such as:

But where is Esope the holly Bastard
But where is

And the Awk function :

cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}'

I get the following output in my console :

1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where

How to get into printed into myFile.txt ? I actually have 300.000 lines and near 2 millions words. Better to output the result into a file.


EDIT: Used answer (by @Sudo_O):

$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort > myfileout.txt

Solution

  • Your pipeline isn't very efficient you should do the whole thing in awk instead:

    awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file > myfile
    

    If you want the output in sorted order:

    awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort > myfile
    

    The actual output given by your pipeline is:

    $ tr ' ' '\n' < file | sort | uniq -c | awk '{print $2"@"$1}'
    Bastard@1
    But@2
    Esope@1
    holly@1
    is@2
    the@1
    where@2
    

    Note: using cat is useless here we can just redirect the input with <. The awk script doesn't make sense either, it's just reversing the order of the words and words frequency and separating them with an @. If we drop the awk script the output is closer to the desired output (notice the preceding spacing however and it's unsorted):

    $ tr ' ' '\n' < file | sort | uniq -c 
          1 Bastard
          2 But
          1 Esope
          1 holly
          2 is
          1 the
          2 where
    

    We could sort again a remove the leading spaces with sed:

    $ tr ' ' '\n' < file | sort | uniq -c | sort | sed 's/^\s*//'
    1 Bastard
    1 Esope
    1 holly
    1 the
    2 But
    2 is
    2 where
    

    But like I mention at the start let awk handle it:

    $ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort
    1 Bastard
    1 Esope
    1 holly
    1 the
    2 But
    2 is
    2 where