arrays perl loops interpolation subroutine

Linear interpolation of missing data from table of values

Here is an excerpt from a data file, my starting point:

Marker      Distance_1  Distance_2  ID
.
.
.
30          13387412    34.80391242 seq-SN_FIRST
31          13387444    34.80391444 seq-SN_Second
31.1             
31.2             
32          13387555    39.80391    seq-SN_Third
.
.
.

This is a tab-delimited file of multiple rows of four elements each. First row is a header. After that, numerous rows of data. The vertical dots are not actually in the real file, but they are here just to represent that data similar to the actual rows shown occur before and after the example fo rows explicitly shown.

Some of the data rows are "full", that is, all four cell entries contain something. Other rows are "blank", with only a first actual entry but followed by 3 tab delimited single spaces. Those white spaces in the blank rows need to be "filled in." The filling in will be done by linear interpolation, using the corresponding cell entries of the immediately preceding and immediately succeeding rows. For example, missing Distance_1 values, in column 2, will be interpolated using the value 13387444 of the preceding row and the value 13387555 of the succeeding row. Similarly so for the column 3 values. Column 4 values are just ignored here.

The first goal of the script is to identify the blocks of data that require filling and their flanking "full" lines. Blank lines will contain 3 tabbed single spaces and will be ID'd that way. Once found, the consecutive sets of blank lines plus flanking full lines are sent to subroutine for interpolation.

#!/usr/bin/perl
use strict;
use warnings;

die "usage: [ map positions file post SAS ]\n\n" unless @ARGV == 1;

my @file = ();

while (my $line = <$ARGV[0]>) {
  chomp $line;
  push(@file, $line);
}

my @consecutive_lines = (); # array collects a current set of consecutive lines requiring linear interpolation
my @filled = ();    # my final array, hopefully correctly filled in

#####
# search for consecutive set of lines in @file
#####

for (my $i = 0; $i < $#file; $i++) {          #  $#file returns the index of the last element in @file

  if ($file[$i] !~ /(?:\t\s){3}/) {           # if not a blank line
                                              # but a "full line"
    push(@filled, $file[$i]);                 # push the header and full lines, until...
  }

  elsif ($file[$i] =~ /(?:\t\s){3}/) {        # ...a blank line is found

    push (@consecutive_lines, $file[$i - 1]); # push preceding full line into array

    while ($file[$i] =~ /(?:\t\s){3}/ and $i < $#file) {  # keep pushing lines, so long as they are blank
                                                          # or end of file
      push(@consecutive_lines, $file[$i++]);
    }

    push(@consecutive_lines, $file[$i]) ;     # else we reach next full line, so push it into array

    my @fillme = linearInterpolation(@consecutive_lines); # send set of lines out for filling

    push(@filled, @fillme);                   # push filled in set of lines into the final array

    @consecutive_lines = ();                  # reset or undef array @consecutive_lines for next cycle

  }    # end of elsif

}    # end of for loop

Thanks to user @Kenosis for lots help with the above, which I have since modified (hopefully not mangled).

Next is the linear interpolation. It is here where I am trying to link the first phase of the script to the second phase. And it is not working well so far.

My goal is to hand off the array @incoming to the subroutine. This array is then split up, so that the actual cell entries are "visible" and can be indexed by the array, and so called upon. I have been trying to figure out how to do this for the column 2 values for Distance_1 first. I feel like this script gets close and it begins to stray at the point just after the interpolated values are calculated.

#####
# subroutine linear interpolation
#####

sub linearInterpolation {
  my @incoming = @_;    # array of consecutive set of lines

  my @splitup;                  # declare new array, will be a "split up" version of @incoming
  my ($A, $B, $C, $D, $E);      # variables for linear interpolation
  my @fillme;                   # declaring the "emtpy" array to be filled in
  my @dist_1_fills;             # array of interpolated values for dist_1

  for (my $i = 0;
    $i < scalar @incoming; $i++)     # loop to split up lines of @incoming
  {                                  # into indexed cell entries
    chomp $incoming[$i];             # and make new array of them
    my @entries = split('\t', $incoming[$i]);
    push(@splitup, @entries);
  }

  $A = $splitup[1];                   # cell entry in column 2 of preceding full line
  $B = $splitup[-3];                  # cell entry in column 2 of succeeding full line

  $C = $splitup[2];                   # cell entry in column 3 of preceding full line
  $D = $splitup[-2];                  # cell entry in column 3 of succeeding full line
  $E = scalar @incoming - 1;          # equals number of lines in the set minus 1

  for (my $i = 1; $i < $E; $i++) {    # need to start finding appropriate
                                      # number interpolated values, given number of
    my @dist_1_fills =
        interpvalues($A, $B, $E, $i); # of lines in consecutive set of lines

    for ($i = 0; $i < scalar @splitup; $i += 4) {
      push(@fillme, $splitup[$i], $dist_1_fills[$i], "dist_2_fills", "--");
                                      # fourth column values will be ignored or filled with --.
                                      # "dist_2_fills" just occupying it's proper spot until I can figure out distance 1 fills
    }
  }
}

#########

sub interpvalues {                  # subroutine to find interpolated values
  my ($A, $B, $E, $i) = @_;
  my $dist_1_answers = (($B - $A) / ($E)) * $i + $A;
  return $dist_1_answers;
}

The code gets confused in the second part that deals with finding the interpolated values and sending them back to the first part of the code to finally fill in the data set. I think specifically my biggest (though probably not my only) problem is trying to fill in the blank lines with the proper values after they have been calculated in the second subroutine.

Any hints and clues are greatly appreciated!

Solution

This program will do what you need. It expects the inout file name as a parameter on the command line.

use strict;
use warnings;

my @saved;
my @needed;

while (<>) {
  chomp;
  my @fields = split /\t/;

  # Pass hrough headers and junk
  unless ($fields[0] and $fields[0] =~ /\d/) {
    print "$_\n";
    next;
  }

  # Save x-value for records without a y-value
  if ($fields[1] !~ /\d/) {
    push @needed, $fields[0];
    next;
  }

  # We have a filled-out row. Calculate any intermediate missing ones
  if (@needed) {
    if ($saved[0] == $fields[0]) {
      die sprintf qq(Duplicate marker values %.1f at line %d of "%s"\n), $saved[0], $., $ARGV;
    }
    my ($a1, $b1) = solve_linear(@saved[0,1], @fields[0,1]);
    my ($a2, $b2) = solve_linear(@saved[0,2], @fields[0,2]);
    while (@needed) {
      my $x = shift @needed;
      my $y1 = $a1 * $x + $b1;
      my $y2 = $a2 * $x + $b2;
      print join("\t", $x, $y1, $y2), "\n";
    }
  }

  print "$_\n";
  @saved = @fields;
}

sub solve_linear {
  my ($x0, $y0, $x1, $y1) = @_;
  my ($dx, $dy) = ($x1 - $x0, $y1 - $y0);
  my $aa = $dy / $dx;
  my $bb = ($y0 * $dx - $x0 * $dy)  / $dx;
  return ($aa, $bb);
}

output

Marker  Distance_1  Distance_2  ID
.
.
.
30  13387412  34.80391242 seq-SN_FIRST
31  13387444  34.80391444 seq-SN_Second
31.1  13387455.1  35.303913996  --
31.2  13387466.2  35.803913552  --
32  13387555  39.80391  seq-SN_Third
.
.
.
Tool completed successfully