Search code examples
perlawksed

Randomly pick a region and process it, a number of times


I have a data like this

>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK

I want to randomly pick up a region with 10 letters from it then calculate the number of F, I want to do that for a certain number of times for example 1000 times or even more

as an example, I randomly pick

LVPSLTRYLT    0

then

ITNLRSFIHK    1

then again randomly go and pick up 10 letters consecutive

AHSRIRKERP    0

This continues until it meets the number of run asked. I want to store all randomly selected ones with their values, because then I want to calculate how many times F is seen

So I do the following

# first I remove the header 
grep -v ">" data.txt > out.txt

then get randomly one region with 10 letters I tried to use shuf with no success,

shuf -n1000 data.txt 

then I tried to use awk and was not successful either

awk 'BEGIN {srand()} !/^$/ { if (rand() == 10) print $0}'

then calculate the number of F and save it in the file

grep -i -e [F] |wc -l 

Note, we should not pick up the same region twice


Solution

  • I've got to assume some things here, and leave some restrictions

    • Random regions to pick don't depend in any way on specific lines

    • Order doesn't matter; there need be N regions spread out through the file

    • File can be a Gigabyte in size, so can't read it whole (would be much easier!)

    • There are unhandled (edge or unlikely) cases, discussed after code

    First build a sorted list of random numbers; these are positions in the file at which regions start. Then, as each line is read, compute its range of characters in the file, and check whether our numbers fall within it. If some do, they mark the start of each random region: pick substrings of desired length starting at those characters. Check whether substrings fit on the line.

    use warnings;
    use strict;
    use feature 'say';
    
    use Getopt::Long;
    use List::MoreUtils qw(uniq);
    
    my ($region_len, $num_regions) = (10, 10);
    my $count_freq_for = 'F';
    #srand(10);
    
    GetOptions(
        'num-regions|n=i' => \$num_regions, 
        'region-len|l=i'  => \$region_len, 
        'char|c=s'        => \$count_freq_for,
    ) or usage();
    
    my $file = shift || usage();
    
    # List of (up to) $num_regions random numbers, spanning the file size
    # However, we skip all '>sp' lines so take more numbers (estimate)
    open my $fh, '<', $file  or die "Can't open $file: $!";
    $num_regions += int $num_regions * fraction_skipped($fh);
    my @rand = uniq sort { $a <=> $b } 
        map { int(rand (-s $file)-$region_len) } 1..$num_regions;
    say "Starting positions for regions: @rand";
    
    my ($nchars_prev, $nchars, $chars_left) = (0, 0, 0); 
    
    my $region;
    
    while (my $line = <$fh>) { 
        chomp $line;
        # Total number of characters so far, up to this line and with this line
        $nchars_prev = $nchars;
        $nchars += length $line;
        next if $line =~ /^\s*>sp/;
    
        # Complete the region if there wasn't enough chars on the previous line 
        if ($chars_left > 0) {
            $region .= substr $line, 0, $chars_left;
            my $cnt = () = $region =~ /$count_freq_for/g;
            say "$region $cnt";
            $chars_left = -1; 
        };  
    
        # Random positions that happen to be on this line    
        my @pos = grep { $_ > $nchars_prev and $_ < $nchars } @rand;
        # say "\tPositions on ($nchars_prev -- $nchars) line: @pos" if @pos;
    
        for (@pos) { 
            my $pos_in_line = $_ - $nchars_prev;
            $region = substr $line, $pos_in_line, $region_len; 
    
            # Don't print if there aren't enough chars left on this line
            last if ( $chars_left = 
                ($region_len - (length($line) - $pos_in_line)) ) > 0;
    
            my $cnt = () = $region =~ /$count_freq_for/g;
            say "$region $cnt";
        }   
    }
    
    
    sub fraction_skipped {
        my ($fh) = @_;
        my ($skip_len, $data_len);
        my $curr_pos = tell $fh;
        seek $fh, 0, 0  if $curr_pos != 0;
        while (<$fh>) {
            chomp;
            if (/^\s*>sp/) { $skip_len += length }
            else           { $data_len += length }
        }
        seek $fh, $curr_pos, 0;  # leave it as we found it
        return $skip_len / ($skip_len+$data_len);
    }
    
    sub usage {
        say STDERR "Usage: $0 [options] file", "\n\toptions: ...";
        exit;
    }
    

    Uncomment the srand line so to have the same run always, for testing. Notes follow.

    Some corner cases

    • If the 10-long window doesn't fit on the line from its random position it is completed in the next line -- but any (possible) further random positions on this line are left out. So if our random list has 1120 and 1122 while a line ends at 1125 then the window starting at 1122 is skipped. Unlikely, possible, and of no consequence (other than having one region fewer).

    • When an incomplete region is filled up in the next line (the first if in the while loop), it is possible that that line is shorter than the remaining needed characters ($chars_left). This is very unlikely and needs an additional check there, which is left out.

    • Random numbers are pruned of dupes. This skews the sequence, but minutely what should not matter here; and we may stay with fewer numbers than asked for, but only by very little

    Handling of issues regarding randomness

    "Randomness" here is pretty basic, what seems suitable. We also need to consider the following.

    Random numbers are drawn over the interval spanning the file size, int(rand -s $file) (minus the region size). But lines >sp are skipped and any of our numbers that may fall within those lines won't be used, and so we may end up with fewer regions than the drawn numbers. Those lines are shorter, thus with a lesser chance of having numbers on them and so not many numbers are lost, but in some runs I saw even 3 out of 10 numbers skipped, ending up with a random sample 70% size of desired.

    If this is a bother, there are ways to approach it. To not skew the distribution even further they all should involve pre-processing the file.

    The code above makes an initial run over the file, to compute the fraction of chars that will be skipped. That is then used to increase the number of random points drawn. This is of course an "average" measure, but which should still produce the number of regions close to desired for large enough files.

    More detailed measures would need to see which random points of a (much larger) distribution are going to be lost to skipped lines and then re-sample to account for that. This may still mess with the distribution, what arguably isn't an issue here, but more to the point may simply be unneeded.

    In all this you read the big file twice. The extra processing time should only be in the seconds but if this is unacceptable change the function fraction_skipped to read through only 10-20% of the file. With large files this should still provide a reasonable estimate.

    Note on a particular test case

    With srand(10) (commented-out line near the beginning) we get the random numbers such that on one line the region starts 8 characters before the end of the line! So that case does test the code to complete the region on the next line.


    Here is a simple driver to run the above a given number of times, for statistics.

    Doing it using builtin tools (system, qx) is altogether harder and libraries (modules) help. I use IPC::Run here. There are quite a few other options.

    Adjust and add code to process as needed for statistics; output is in files.

    use warnings;
    use strict;
    use feature 'say';
    
    use Getopt::Long;
    use IPC::Run qw(run);
    
    my $outdir = 'rr_output';         # pick a directory name
    mkdir $outdir if not -d $outdir;    
    my $prog  = 'random_regions.pl';  # your name for the program
    my $input = 'data_file.txt';      # your name for input file     
    my $ch = 'F';
    
    my ($runs, $regions, $len) = (10, 10, 10);    
    GetOptions(
        'runs|n=i'  => \$runs, 
        'regions=i' => \$regions, 
        'length=i'  => \$len, 
        'char=s'    => \$ch, 
        'input=s'   => \$input
    ) or usage();
    
    my @cmd = ( $prog, $input, 
        '--num-regions', $regions, 
        '--region-len', $len, 
        '--char', $ch
    );    
    say "Run: @cmd, $runs times.";
    
    for my $n (1..$runs) {
        my $outfile = "$outdir/regions_r$n.txt";
        say "Run #$n, output in: $outdir/$outfile";
        run \@cmd, '>', $outfile  or die "Error with @cmd: $!";
    }    
    
    sub usage {
        say STDERR "Usage: $0 [options]", "\n\toptions: ...";
        exit;
    }
    

    Please expand on the error checking. See for instance this post and links on details.

    Simplest use: driver_random.pl -n 4, but you can give all of main program's parameters.

    The called program (random_regions.pl above) must be executable.


      Some, from simple to more capable: IPC::System::Simple, Capture::Tiny, IPC::Run3. (Then comes IPC::Run used here.) Also see String::ShellQuote, to prepare commands without quoting issues, shell injection bugs, and other problems. See links (examples) assembled in this post, for example.