Search code examples
perlfind

Use Perl to loop over files and calculate the mean of each column


I'm new to perl and I would like to lean how to use loops with it. I have multiple directories and each directory contain a file named data.txt. The data.txt file has several columns. I basically need to use a loop to calculate the mean of each column for each data.txt file.

I have this command that does the job for one single file:

perl -lane 'for $c (0..$#F){$t[$c] += $F[$c]}; END{for $c (0..$#t){print $t[$c]/$.}}' data.txt`

I wish to write a script where I visit every directory, read every file that's in it and apply the command.

Example:

data.txt:

-79.2335  0.4041    71.9143  1.3392    -0.7687  0.0212    -8.0934  1.1425 
-74.4163  0.6188    60.0468  1.8782    -0.8540  0.0305    -15.1574  1.4755 
-74.4118  0.6046    62.1771  1.8058    -0.9143  0.0304    -13.2272  1.3408 
-74.3895  0.5935    66.4264  1.6532    -0.8509  0.0223    -8.8819  1.2670 
-74.3192  0.5589    67.1619  1.4763    -0.9656  0.0274    -8.1090  1.1450 
-73.8272  0.6274    61.6632  1.7554    -0.8840  0.0256    -13.0435  1.3641 
-73.3525  0.5856    60.6622  1.7872    -0.8489  0.0222    -13.5014  1.3947 
-73.3206  0.6275    53.3129  2.2961    -0.7962  0.0337    -20.8195  1.8538 
-72.5461  0.5212    62.0359  1.4267    -0.9378  0.0240    -11.4203  1.0295 
-72.3058  0.7225    56.2304  2.1480    -0.7539  0.0293    -16.7954  1.5952 
-72.1180  0.6460    51.7954  2.0845    -0.8479  0.0265    -21.0355  1.4630 
-72.0690  0.4905    58.8372  1.3918    -0.9866  0.0333    -14.1823  1.1045 
-71.7949  0.5799    55.6006  1.9189    -0.8541  0.0313    -17.0112  1.4530 
-71.3074  0.4482    45.9271  2.1135    -0.6637  0.0354    -25.9309  1.8761 
-71.2542  0.4879    57.3196  1.5406    -0.9523  0.0281    -14.9113  1.2705 
-71.2421  0.5480    47.9065  2.2445    -0.8107  0.0352    -24.2489  1.7997 
-70.3751  0.5278    49.5489  1.8395    -0.8208  0.0371    -21.5205  1.4994 
-69.2181  0.4823    54.8234  1.0645    -0.9897  0.0246    -15.3506  0.9369 
-69.0456  0.4650    40.3798  2.0117    -0.6476  0.0360    -29.3403  1.7013 
-66.5402  0.5006    42.1805  1.7872    -0.7692  0.0356    -25.1431  1.4522

Output:

-72.354355   0.552015   56.297505   1.77814   -0.845845   0.029485   -16.88618   1.408235

Solution

  • As your comments imply that you have a simple directory structure with one main directory called mean with 100s of subdirectories, each with a file called data.txt, the list of files can be compiled easily with a glob, and the math is fairly straightforward. This is a suggestion how it can be done.

    I would not use $. as a way to calculate the average, since it can be corrupted by other factors. But just use a count variable for each file, and count the non-blank lines.

    use strict;
    use warnings;
    use feature 'say';
    
    for my $data (glob "mean/*/data.txt") {    # get list of files
        open my $fh, '<', $data or die "Cannot open file '$data': $!";
        my @sum;
        my $count = 0;
        while (<$fh>) {
            $count++ if /\S/;                  # count non-blank lines
            my @fields = split;                # split on whitespace
            for (0 .. $#fields) {
                $sum[$_] += $fields[$_];       # sum columns
            }
        }
        say $data;                             # file name
        say join "\t",            # 3. ...join them with tab and print
            map $_/$count,        # 2. ...for each sum, divide by count
            @sum;                 # 1. Take list of sums...
    }
    

    Output:

    mean/A/data.txt
    -72.354355      0.552015        56.297505       1.77814 -0.845845       0.029485        -16.88618       1.408235
    mean/B/data.txt
    -142.354355     0.552015        56.297505       1.77814 -0.845845       0.029485        -16.88618       1.408235
    mean/C/data.txt
    -72.354355      17.152015       56.297505       1.77814 -0.845845       0.029485        -16.88618       1.408235