Search code examples
perlgrep

Recursive subdirectory grep


I'm trying to grep the string Distance: from each pairsAngles.txt file within over 2,000 subdirectories; the names of the subdirectories are obtained from a CSV file.

Each subdirectory contains one pairsAngles.txt, within which there is only one line that contains Distance: . However, my current foreach and while loops give me eight Distance values for each subdirectory.

In addition, each subsequent subdirectory gets all the distances from the previous subdirectories.

Like this:

enter image description here

A text version of the picture (row #4, column #2 has 4*8 = 32 entries of Distance)

All the pairsAngles.txt files are in subdirectories, and each subdirectory has a unique name.

I first read all the subdirectory names from the CSV file and split them into an array, then retrieve every element from that array to get into a subdirectory so that I can grep.

clst1.csv has only one column, which is the subfolder names:

oligomerAngle-1h2s-000_001-0003_0025_A-0034_0056_A-B004A012
oligomerAngle-5ax0-000_001-0010_0036_A-0042_0064_A-B004A013
oligomerAngle-4qnd-004_005-0046_0065_A-0069_0091_A-A004B006
oligomerAngle-2j8c-003_004-0171_0196_L-0226_0250_L-B011A001
oligomerAngle-2j8c-003_004-0171_0196_L-0226_0250_L-B011A001

Distance: 7.98675 
Distance: 7.98675 
Distance: 7.98675 
Distance: 7.98675 
Distance: 7.98675 
Distance: 7.98675 
Distance: 7.98675 
Distance: 7.98675
Distance: 7.95099 
Distance: 7.95099 
Distance: 7.95099 
Distance: 7.95099 
Distance: 7.95099 
Distance: 7.95099 
Distance: 7.95099
Distance: 7.95099
Distance: 7.87554 
Distance: 7.87554 
Distance: 7.87554 
Distance: 7.87554 
Distance: 7.87554 
Distance: 7.87554
Distance: 7.87554 
Distance: 7.87554 
Distance: 7.69417 
Distance: 7.69417 
Distance: 7.69417 
Distance: 7.69417 
Distance: 7.69417
Distance: 7.69417 
Distance: 7.69417 
Distance: 7.69417

But the actual value should just be "Distance: 7.69417" Not sure what went wrong. Here's the code:

use File::Find;
use Text::CSV_XS;

my @pairs  = ();
my @result = ();
my $in;
my $out;
my $c1;
my $dist = "";
my $dir  = "/home/oligomerAngle";

my $cluster = "clst1.csv";
open( $in, $cluster ) || die "cannot open \"$cluster\": $!";

my $cU = "clst1Updated.csv";
open( $out, ">$cU" ) || die "cannot open '$cU' $!";

my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );

while ( $c1 = <$in> ) {
    chomp $c1;
    @pairs = split( ' ', $c1 );

    foreach my $pair (@pairs) {

        find( \&Matches, "$dir/$c1" );

        sub Matches {
            open( my $subdir, "pairsAngles.txt" ) or die "$!";

            while ( $dist = <$subdir> ) {

                if ( $dist =~ m/Distance:/ ) {
                    push( @result, "$dist" );
                }
            }
        }

        chdir "..";
        $csv->say( $out, [ "@pairs", "@result" ] );
    }
}

if ( not $csv->eof ) {
    $csv->error_diag();
}

close $out or die "$!";

Solution

  • The posted code seems to over-complicate the matters, given the clarifications.

    The code below takes a subdirectory name from the $cluster file iterated over by <$in>, then it builds the file name using $dir and it. Lines in the file are then iterated over to find the one with the pattern. Once that happens we print results and move on to the next file (in the next subdirectory).

    Note that we don't really need @result unless more processing happens later.

    # Iterate over subdirectories that each have the file
    while ( $c1 = <$in> ) {
        chomp $c1;
    
        # Build the full file name in this subdirectory, open the file
        my $filename = "$dir/$c1/pairsAngles.txt";
        open my $fh_in, $filename  or die "$!";
    
        # Iterate over lines in the file to find the pattern
        while ( my $line = <$fh_in> ) { 
            if ( $line =~ m/Distance:/ ) { 
                # Found our result, print output
                chomp($line);
                $csv->say($out, [$c1, $line]);
                push @result, $line;
                # No need to continue if we know there is exactly one
                last; 
            }   
        }   
    }
    # Do something else with @result if needed ...