Search code examples
regexalgorithmperlbioinformaticspalindrome

Finding Palindromes(perfect palindromes) in more than one protein sequences using perl


I am a newbie in Perl (Regular Expressions). I need aa example on how to write a program for finding out palindromes (perfect) in more than one protein sequences (let it be 4 sequence with 200 amino acids in number, in a file) I have to filter out, the palindromes and the position of palindromes present in the sequences.

>TRE|Q47404|Q47404 (409 AA) Glycosyl transferase [Escherichia coli]
MIFDASLKKLRKLFVNPIGFFRDSWFFNSKNKAEELLSPLKIKSKNIFIVAHLGQLKKAE
LFIQKFSRRSNFLIVLATKKNTEMPRLILEQMNKKLFSSYKLLFIPTEPNTFSLKKVIWF
YNVYKYIVLNSKAKDAYFMSYAQHYAIFIWLFKKNNIRCSLIEEGTGTYKTEKKKPLVNI
NFYSWIINSIILFHYPDLKFENVYGTFPNLLKEKFDAKKIFEFKTIPLVKSSTRMDNLIH

>TRE|O06435|O06435 (492 AA) SynE [Neisseria meningitidis]
MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNN
LLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTI
QPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNN
LHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDY
IFVSQRYPVSDEVYYKTIVETLNQMSLRIEGKIFIKLHPKEMENKNIMSLFLNMVTINPR

>TRE|Q8VRL9|Q8VRL9 (492 AA) SiaD [Neisseria meningitidis]
MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNN
LLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTI
QPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNN
LHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDY

I need the output of perfect palindromes in this and also their positions. I have gone through many articles, but couldn't get any better idea. Please suggest me some techniques and programs for this.


Solution

  • There are three regex features that are required for this challenge:

    1. perlretut - Recursive Patterns — To find palindromes

    2. perlretut - Positive Lookahead Assertions — To find matches that overlap

    3. perlretut - Position Information — To determine where the matches are in the string.

    Putting these together gives this result:

    use strict;
    use warnings;
    
    my $pp = qr/(?: (\w) (?1) \g{-1} | \w? )/ix;
    
    local $/ = '';
    
    while (<DATA>) {
        chomp;
        my ($header, @lines) = split "\n";
        my $data = join '', @lines;
    
        print "$header\n$data\n";
    
        while ($data =~ /(?=($pp))/g) {
            print "$-[0] - $1\n" if length($1) > 2;
        }
    }
    
    __DATA__
    >TRE|Q47404|Q47404 (409 AA) Glycosyl transferase [Escherichia coli]
    MIFDASLKKLRKLFVNPIGFFRDSWFFNSKNKAEELLSPLKIKSKNIFIVAHLGQLKKAE
    LFIQKFSRRSNFLIVLATKKNTEMPRLILEQMNKKLFSSYKLLFIPTEPNTFSLKKVIWF
    YNVYKYIVLNSKAKDAYFMSYAQHYAIFIWLFKKNNIRCSLIEEGTGTYKTEKKKPLVNI
    NFYSWIINSIILFHYPDLKFENVYGTFPNLLKEKFDAKKIFEFKTIPLVKSSTRMDNLIH
    
    >TRE|O06435|O06435 (492 AA) SynE [Neisseria meningitidis]
    MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNN
    LLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTI
    QPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNN
    LHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDY
    IFVSQRYPVSDEVYYKTIVETLNQMSLRIEGKIFIKLHPKEMENKNIMSLFLNMVTINPR
    
    >TRE|Q8VRL9|Q8VRL9 (492 AA) SiaD [Neisseria meningitidis]
    MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNN
    LLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTI
    QPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNN
    LHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDY
    

    Outputs:

    >TRE|Q47404|Q47404 (409 AA) Glycosyl transferase [Escherichia coli]
    MIFDASLKKLRKLFVNPIGFFRDSWFFNSKNKAEELLSPLKIKSKNIFIVAHLGQLKKAELFIQKFSRRSNFLIVLATKKNTEMPRLILEQMNKKLFSSYKLLFIPTEPNTFSLKKVIWFYNVYKYIVLNSKAKDAYFMSYAQHYAIFIWLFKKNNIRCSLIEEGTGTYKTEKKKPLVNINFYSWIINSIILFHYPDLKFENVYGTFPNLLKEKFDAKKIFEFKTIPLVKSSTRMDNLIH
    6 - LKKL
    29 - KNK
    40 - KIK
    42 - KSK
    46 - IFI
    66 - SRRS
    86 - LIL
    123 - YKY
    131 - KAK
    146 - IFI
    164 - GTG
    165 - TGT
    172 - KKK
    178 - NIN
    211 - KEK
    220 - FEF
    >TRE|O06435|O06435 (492 AA) SynE [Neisseria meningitidis]
    MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNNLLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTIQPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNNLHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDYIFVSQRYPVSDEVYYKTIVETLNQMSLRIEGKIFIKLHPKEMENKNIMSLFLNMVTINPR
    26 - FSSF
    55 - KLK
    70 - MKM
    114 - KLLK
    135 - SLLS
    137 - LSL
    154 - TAT
    205 - NAN
    220 - STS
    222 - SQS
    271 - KIFIK
    272 - IFI
    280 - EME
    283 - NKN
    289 - LFL
    >TRE|Q8VRL9|Q8VRL9 (492 AA) SiaD [Neisseria meningitidis]
    MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNNLLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTIQPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNNLHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDY
    26 - FSSF
    55 - KLK
    70 - MKM
    114 - KLLK
    135 - SLLS
    137 - LSL
    154 - TAT
    205 - NAN
    220 - STS
    222 - SQS