Search code examples
regexperlbioinformaticsgenbank

Change the character length of single line strings using regex


I have extracted a sequence from a genbank file that consists of single lines of strings with 60 bases (with a \n at the end). How to modify the sequence using perl so that it prints 120 bases for each line using regex and not bioperl. original format:

    1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
   61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
  121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
  181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
  241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
  301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
  361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
  421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
  481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat 

I only managed to make them into strings with the length of 60 characters. Still trying to figure out how to make them 120 characters long.

my @lines= <$FH_IN>;
foreach my $line (@lines) {
    if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
            $line=~ s/$1//;
            $line=~ s/ //g;
            print $line;
    }

}

example of input:

agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat

which has 60 bases for each single line string.

Update (still not giving seq lines with 120 bases long):

my @seq_60;
foreach my $line (@lines) {
        if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
                $line=~ s/$1//;
                $line=~ s/ //g;
                push (@seq_60, $line);
        }
}

my @output;
for (my $pos= 0; $pos< @seq_60; $pos+= 2) {
        push (@output, $seq_60[$pos] . $seq_60[$pos+1]);
}

print @output;

Solution

  • How about:

    s/(^|\n)([^\n]{60})\n/$1$2/g
    

    In action:

    use strict;
    use warnings;
    use 5.014;
    
    my $str = q/agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
    agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
    gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
    cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
    ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
    gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
    aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
    atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat/;
    
    $str =~ s/(^|\n)([^\n]{60})\n/$1$2/g;
    say $str;
    

    Output:

    agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
    gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
    ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
    aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggccatccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat
    

    Explanation:

    (^|\n)      : group 1, start of string or line break
    (           : start group 2
      [^\n]{60} : anything that is not a line break 60 times
    )           : end group 2
    \n          : line break
    

    Edit according to comment:

    Join lines by pair:

    my @out;
    for (my $i = 0; $i < @arr; $i += 2) {
        chomp($in[$i]);
        push @out, $in[$i] . $in[$i+1];
    }