I have extracted a sequence from a genbank file that consists of single lines of strings with 60 bases (with a \n at the end). How to modify the sequence using perl so that it prints 120 bases for each line using regex and not bioperl. original format:
1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat
I only managed to make them into strings with the length of 60 characters. Still trying to figure out how to make them 120 characters long.
my @lines= <$FH_IN>;
foreach my $line (@lines) {
if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
$line=~ s/$1//;
$line=~ s/ //g;
print $line;
}
}
example of input:
agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat
which has 60 bases for each single line string.
Update (still not giving seq lines with 120 bases long):
my @seq_60;
foreach my $line (@lines) {
if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
$line=~ s/$1//;
$line=~ s/ //g;
push (@seq_60, $line);
}
}
my @output;
for (my $pos= 0; $pos< @seq_60; $pos+= 2) {
push (@output, $seq_60[$pos] . $seq_60[$pos+1]);
}
print @output;
How about:
s/(^|\n)([^\n]{60})\n/$1$2/g
In action:
use strict;
use warnings;
use 5.014;
my $str = q/agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat/;
$str =~ s/(^|\n)([^\n]{60})\n/$1$2/g;
say $str;
Output:
agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggccatccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat
Explanation:
(^|\n) : group 1, start of string or line break
( : start group 2
[^\n]{60} : anything that is not a line break 60 times
) : end group 2
\n : line break
Edit according to comment:
Join lines by pair:
my @out;
for (my $i = 0; $i < @arr; $i += 2) {
chomp($in[$i]);
push @out, $in[$i] . $in[$i+1];
}