Here is an excerpt from a data file, my starting point:
Marker Distance_1 Distance_2 ID
.
.
.
30 13387412 34.80391242 seq-SN_FIRST
31 13387444 34.80391444 seq-SN_Second
31.1
31.2
32 13387555 39.80391 seq-SN_Third
.
.
.
This is a tab-delimited file of multiple rows of four elements each. First row is a header. After that, numerous rows of data. The vertical dots are not actually in the real file, but they are here just to represent that data similar to the actual rows shown occur before and after the example fo rows explicitly shown.
Some of the data rows are "full", that is, all four cell entries contain something. Other rows are "blank", with only a first actual entry but followed by 3 tab delimited single spaces. Those white spaces in the blank rows need to be "filled in." The filling in will be done by linear interpolation, using the corresponding cell entries of the immediately preceding and immediately succeeding rows. For example, missing Distance_1 values
, in column 2, will be interpolated using the value 13387444
of the preceding row and the value 13387555
of the succeeding row. Similarly so for the column 3 values. Column 4 values are just ignored here.
The first goal of the script is to identify the blocks of data that require filling and their flanking "full" lines. Blank lines will contain 3 tabbed single spaces and will be ID'd that way. Once found, the consecutive sets of blank lines plus flanking full lines are sent to subroutine for interpolation.
#!/usr/bin/perl
use strict;
use warnings;
die "usage: [ map positions file post SAS ]\n\n" unless @ARGV == 1;
my @file = ();
while (my $line = <$ARGV[0]>) {
chomp $line;
push(@file, $line);
}
my @consecutive_lines = (); # array collects a current set of consecutive lines requiring linear interpolation
my @filled = (); # my final array, hopefully correctly filled in
#####
# search for consecutive set of lines in @file
#####
for (my $i = 0; $i < $#file; $i++) { # $#file returns the index of the last element in @file
if ($file[$i] !~ /(?:\t\s){3}/) { # if not a blank line
# but a "full line"
push(@filled, $file[$i]); # push the header and full lines, until...
}
elsif ($file[$i] =~ /(?:\t\s){3}/) { # ...a blank line is found
push (@consecutive_lines, $file[$i - 1]); # push preceding full line into array
while ($file[$i] =~ /(?:\t\s){3}/ and $i < $#file) { # keep pushing lines, so long as they are blank
# or end of file
push(@consecutive_lines, $file[$i++]);
}
push(@consecutive_lines, $file[$i]) ; # else we reach next full line, so push it into array
my @fillme = linearInterpolation(@consecutive_lines); # send set of lines out for filling
push(@filled, @fillme); # push filled in set of lines into the final array
@consecutive_lines = (); # reset or undef array @consecutive_lines for next cycle
} # end of elsif
} # end of for loop
Thanks to user @Kenosis for lots help with the above, which I have since modified (hopefully not mangled).
Next is the linear interpolation. It is here where I am trying to link the first phase of the script to the second phase. And it is not working well so far.
My goal is to hand off the array @incoming
to the subroutine. This array is then split up, so that the actual cell entries are "visible" and can be indexed by the array, and so called upon. I have been trying to figure out how to do this for the column 2 values for Distance_1
first. I feel like this script gets close and it begins to stray at the point just after the interpolated values are calculated.
#####
# subroutine linear interpolation
#####
sub linearInterpolation {
my @incoming = @_; # array of consecutive set of lines
my @splitup; # declare new array, will be a "split up" version of @incoming
my ($A, $B, $C, $D, $E); # variables for linear interpolation
my @fillme; # declaring the "emtpy" array to be filled in
my @dist_1_fills; # array of interpolated values for dist_1
for (my $i = 0;
$i < scalar @incoming; $i++) # loop to split up lines of @incoming
{ # into indexed cell entries
chomp $incoming[$i]; # and make new array of them
my @entries = split('\t', $incoming[$i]);
push(@splitup, @entries);
}
$A = $splitup[1]; # cell entry in column 2 of preceding full line
$B = $splitup[-3]; # cell entry in column 2 of succeeding full line
$C = $splitup[2]; # cell entry in column 3 of preceding full line
$D = $splitup[-2]; # cell entry in column 3 of succeeding full line
$E = scalar @incoming - 1; # equals number of lines in the set minus 1
for (my $i = 1; $i < $E; $i++) { # need to start finding appropriate
# number interpolated values, given number of
my @dist_1_fills =
interpvalues($A, $B, $E, $i); # of lines in consecutive set of lines
for ($i = 0; $i < scalar @splitup; $i += 4) {
push(@fillme, $splitup[$i], $dist_1_fills[$i], "dist_2_fills", "--");
# fourth column values will be ignored or filled with --.
# "dist_2_fills" just occupying it's proper spot until I can figure out distance 1 fills
}
}
}
#########
sub interpvalues { # subroutine to find interpolated values
my ($A, $B, $E, $i) = @_;
my $dist_1_answers = (($B - $A) / ($E)) * $i + $A;
return $dist_1_answers;
}
The code gets confused in the second part that deals with finding the interpolated values and sending them back to the first part of the code to finally fill in the data set. I think specifically my biggest (though probably not my only) problem is trying to fill in the blank lines with the proper values after they have been calculated in the second subroutine.
Any hints and clues are greatly appreciated!
This program will do what you need. It expects the inout file name as a parameter on the command line.
use strict;
use warnings;
my @saved;
my @needed;
while (<>) {
chomp;
my @fields = split /\t/;
# Pass hrough headers and junk
unless ($fields[0] and $fields[0] =~ /\d/) {
print "$_\n";
next;
}
# Save x-value for records without a y-value
if ($fields[1] !~ /\d/) {
push @needed, $fields[0];
next;
}
# We have a filled-out row. Calculate any intermediate missing ones
if (@needed) {
if ($saved[0] == $fields[0]) {
die sprintf qq(Duplicate marker values %.1f at line %d of "%s"\n), $saved[0], $., $ARGV;
}
my ($a1, $b1) = solve_linear(@saved[0,1], @fields[0,1]);
my ($a2, $b2) = solve_linear(@saved[0,2], @fields[0,2]);
while (@needed) {
my $x = shift @needed;
my $y1 = $a1 * $x + $b1;
my $y2 = $a2 * $x + $b2;
print join("\t", $x, $y1, $y2), "\n";
}
}
print "$_\n";
@saved = @fields;
}
sub solve_linear {
my ($x0, $y0, $x1, $y1) = @_;
my ($dx, $dy) = ($x1 - $x0, $y1 - $y0);
my $aa = $dy / $dx;
my $bb = ($y0 * $dx - $x0 * $dy) / $dx;
return ($aa, $bb);
}
output
Marker Distance_1 Distance_2 ID
.
.
.
30 13387412 34.80391242 seq-SN_FIRST
31 13387444 34.80391444 seq-SN_Second
31.1 13387455.1 35.303913996 --
31.2 13387466.2 35.803913552 --
32 13387555 39.80391 seq-SN_Third
.
.
.
Tool completed successfully