Search code examples
perlperl-data-structures

Using Perl hashes to handle tab-delimited files


I have two files:

  • file_1 has three columns (Marker(SNP), Chromosome, and position)
  • file_2 has three columns (Chromosome, peak_start, and peak_end).

All columns are numeric except for the SNP column.

The files are arranged as shown in the screenshots. file_1 has several hundred SNPs as rows while file_2 has 61 peaks. Each peak is marked by a peak_start and peak_end. There can be any of the 23 chromosomes in either file and file_2 has several peaks per chromosome.

I want to find if the position of the SNP in file_1 falls within the peak_start and peak_end in file_2 for each matching chromosome. If it does, I want to show which SNP falls in which peak (preferably write output to a tab-delimited file).

I would prefer to split the file, and use hashes where the chromosome is the key. I have found only a few questions remotely similar to this, but I could not understand well the suggested solutions.

Here is the example of my code. It is only meant to illustrate my question and so far doesn't do anything so think of it as "pseudocode".

#!usr/bin/perl

use strict;
use warnings;

my (%peaks, %X81_05);
my @array;

# Open file or die

unless (open (FIRST_SAMPLE, "X81_05.txt")) {
    die "Could not open X81_05.txt";
}

# Split the tab-delimited file into respective fields

while (<FIRST_SAMPLE>) {

    chomp $_;
    next if (m/Chromosome/); # Skip the header

    @array = split("\t", $_);
    ($chr1, $pos, $sample) = @array;

    $X81_05{'$array[0]'} = (
        'position' =>'$array[1]'
    )
}

close (FIRST_SAMPLE);

# Open file using file handle
unless (open (PEAKS, "peaks.txt")) {
    die "could not open peaks.txt";
}

my ($chr, $peak_start, $peak_end);

while (<PEAKS>) {
    chomp $_;

    next  if (m/Chromosome/); # Skip header
    ($chr, $peak_start, $peak_end) = split(/\t/);
    $peaks{$chr}{'peak_start'} = $peak_start;
    $peaks{$chr}{'peak_end'}  = $peak_end;
}

close (PEAKS);

for my $chr1 (keys %X81_05) {
    my $val = $X81_05{$chr1}{'position'};

    for my $chr (keys %peaks) {
        my $min = $peaks{$chr}{'peak_start'};

        my $max = $peaks{$chr}{'peak_end'};

        if (($val > $min) and ($val < $max)) {
            #print $val, " ", "lies between"," ", $min, " ", "and", " ", $max, "\n";
        }
        else {
                #print $val, " ", "does not lie between"," ", $min, " ", "and", " ", $max, "\n";
        }
    }
}

More awesome code:

  1. https://i.sstatic.net/fzwRQ.png
  2. https://i.sstatic.net/2ryyI.png

Solution

  • The points raised by @David are good; try to incorporate those in your programs. (I have borrowed most of the code from @David's post.)

    One thing I didn't understand is that why load both peak values and position in hash, as loading one would suffice. As each chromosome has more than one record, use HoA. My solution is based on that. You might need to change the cols and their positions.

    use strict;
    use warnings;
    
    our $Sep = "\t";
    open (my $peak_fh, "<", "data/file2");
    my %chromosome_hash;
    
    while (my $line = <$peak_fh>) {
        chomp $line;
        next if $line =~ /Chromosome/; #Skip Header
        my ($chromosome) = (split($Sep, $line))[0];
        push @{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromo
    }
    close $peak_fh;
    
    open (my $position_fh, "<", "data/file1");
    
    while (my $line = <$position_fh>) {
        chomp $line;
        my ($chromosome, $snp, $position) = split ($Sep, $line);
        next unless exists $chromosome_hash{$chromosome};
    
        foreach my $peak_line (@{$chromosome_hash{$chromosome}}) {
            my ($start,$end) = (split($Sep, $line))[1,2];
    
            if ($position >= $start and $position <= $end) {
                print "MATCH REQUIRED-DETAILS...$line-$peak_line\n";
            }
            else {
                print "NO MATCH REQUIRED-DETAILS...$line-$peak_line\n";
            }
        }
    }
    close $position_fh;