Search code examples
bashloopsawkhistogram

Creating histograms in bash


EDIT

I read the question that this is supposed to be a duplicate of (this one). I don't agree. In that question the aim is to get the frequencies of individual numbers in the column. However if I apply that solution to my problem, I'm still left with my initial problem of grouping the frequencies of the numbers in a particular range into the final histogram. i.e. if that solution tells me that the frequency of 0.45 is 2 and 0.44 is 1 (for my input data), I'm still left with the problem of grouping those two frequencies into a total of 3 for the range 0.4-0.5.

END EDIT

QUESTION-

I have a long column of data with values between 0 and 1. This will be of the type-

0.34
0.45
0.44
0.12
0.45
0.98
.
.
.

A long column of decimal values with repetitions allowed.

I'm trying to change it into a histogram sort of output such as (for the input shown above)-

0.0-0.1  0
0.1-0.2  1
0.2-0.3  0
0.3-0.4  1 
0.4-0.5  3
0.5-0.6  0
0.6-0.7  0
0.7-0.8  0
0.8-0.9  0
0.9-1.0  1

Basically the first column has the lower and upper bounds of each range and the second column has the number of entries in that range.

I wrote it (badly) as-

for i in $(seq 0 0.1 0.9)
do 
    awk -v var=$i '{if ($1 > var && $1 < var+0.1 ) print $1}' input | wc -l; 
done

Which basically does a wc -l of the entries it finds in each range.

Output formatting is not a part of the problem. If I simply get the frequencies corresponding to the different bins , that will be good enough. Also please note that the bin size should be a variable like in my proposed solution.

I already read this answer and want to avoid the loop. I'm sure there's a much much faster way in awk that bypasses the for loop. Can you help me out here?


Solution

  • Following the same algorithm of my previous answer, I wrote a script in awk which is extremely fast (look at the picture). enter image description here

    The script is the following:

    #!/usr/bin/awk -f
    
    BEGIN{
        bin_width=0.1;
        
    }
    {
        bin=int(($1-0.0001)/bin_width);
        if( bin in hist){
            hist[bin]+=1
        }else{
            hist[bin]=1
        }
    }
    END{
        for (h in hist)
            printf " * > %2.2f  ->  %i \n", h*bin_width, hist[h]
    }
       
    

    The bin_width is the width of each channel. To use the script just copy it in a file, make it executable (with chmod +x <namefile>) and run it with ./<namefile> <name_of_data_file>.