There is a fasta file of a protein sequence, and I want to find chromosome 2 or 5 containing three amino acids of KSP. How do I write a pattern string.
Here is a brief overview of the fasta file:
>AT1G05230.1
MFEPNMLLAAMNNADSNNHNYNHEDNNNEGFLRDDEFDSPNTKSGSENQEGGSGNDQDPLHPNKKKRYHRHTQLQIQEME
. . . .
DFLRDENSRNEWDILSNGGVVQEMAHIANGRDTGNCVSLLRVNSANSSQSNMLILQESCTDPTASFVIYAPVDIVAMNIV
LNGGDPDYVALLPSGFAILPDGNANSGAPGGDGGSLLTVAFQILVDSVPTAKLSLGSVATVNNLIACTVERIKASMSCET
A*
>AT1G05230.2
MFEPNMLLAAMNNADSNNHNYNHEDNNNEGFLRDDEFDSPNTKSGSENQEGGSGNDQDPLHPNKKKRYHRHTQLQIQEME
. . . . . .
DFLRDENSRNEWDILSNGGVVQEMAHIANGRDTGNCVSLLRVNSANSSQSNMLILQESCTDPTASFVIYAPVDIVAMNIV
LNGGDPDYVALLPSGFAILPDGNANSGAPGGDGGSLLTVAFQILVDSVPTAKLSLGSVATVNNLIACTVERIKASMSCET
A*
>AT2G35940.1
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
..........
....KSP......TNYHMNPNHNGDLEGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
..........
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*
>AT2G35940.2
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
................................................................................
RAWLFEHFLHPYPKDSDKHMLAKQTGLTRSQVSNWFINARVRLWKPMVEEMYMEEMKEQAKNMGSMEKTPLDQSNEDSAS
.....KSP..................EGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
................................................................................
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*
>AT3G03660.1
MDQEQTPHSPTRHSRSPPSSASGSTSAEPVRSRWSPKPEQILILESIFHSGMVNPPKEETVRIRKMLEKFGAVGDANVFY
................................................................................
VPLPTDEFGFLMHSLQHGEAYFLVPRQT*
>AT3G11260.1
MSFSVKGRSLRGNNNGGTGTKCGRWNPTVEQLKILTDLFRAGLRTPTTDQIQKISTELSFYGKIESKNVFYWFQNHKARE
................................................................................
PYSSCGAEMEHPPPLDLRLSFL*
>AT3G61890.1
MEEGDFFNCCFSEISSGMTMNKKKMKKSNNQKRFSEEQIKSLELIFESETRLEPRKKVQVARELGLQPRQVAIWFQNKRA
...KSP..........................................................................
RLDQGSVLCNDGDYNNNIKTEYFGFEEETDHELMNIVEKADDSCLTSSENWGGFNSDSLLDQSSSNYPNWWEFWS*
................................................................................
(lots of sequences)
................................................................................
>AT5G11060.1
MAFHNNHFNHFTDQQQHQPPPPPQQQQQQHFQESAPPNWLLRSDNNFLNLHTAASAAATSSDSPSSAAANQWLSRSSSFL
................................................................................
SVLKSWWQSHSKWPYPTEEDKARLVQETGLQLKQINNWFINQRKRNWHSNPSSSTVSKNKRRSNAGENSGRDR*
>AT5G15150.1
MYMYEEERNNINNNQEGLRLEMAFPQHGFMFQQLHEDNAHHLPSPTSLPSCPPHLFYGGGGNYMMNRSMSFTGVSDHHHL
..KSP...........TTTNNMNDQDQVGEEDNLSDDGSHMMLGEKKKRLNLEQVRALEKSFELGNKLEPERKMQLAKAL
QNRRARWKTKQLERDYDSLKKQFDVLKSDNDSLLAHNKKLHAELVALKKHDRKESAKIKREFAEASWSNNGSTENNHNNN
SSDANHVSMIKDLFPSSIRSATATTTSTHIDHQIVQDQDQGFCNMFNGIDETTSASYWAWPDQQQQHHNHHQFN*
First I can write pattern string that match number 2 or 5 chromosome, such as >AT[25]G
。
I failed when I wrote the pattern string like this(>AT[25]G.*KSP.*
) to match the sequence that met the condition.
By the way, all sequences start with a greater than sign>
and end with an asterisk*
, and all the amino acids are capitalized.
For example, the expected result would be a sequence of all three amino acids of KSP on chromosomes 2 and 5
>AT2G35940.1
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
..........
....KSP......TNYHMNPNHNGDLEGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
..........
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*
>AT2G35940.2
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
................................................................................
RAWLFEHFLHPYPKDSDKHMLAKQTGLTRSQVSNWFINARVRLWKPMVEEMYMEEMKEQAKNMGSMEKTPLDQSNEDSAS
.....KSP..................EGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
................................................................................
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*
>AT5G15150.1
MYMYEEERNNINNNQEGLRLEMAFPQHGFMFQQLHEDNAHHLPSPTSLPSCPPHLFYGGGGNYMMNRSMSFTGVSDHHHL
..KSP...........TTTNNMNDQDQVGEEDNLSDDGSHMMLGEKKKRLNLEQVRALEKSFELGNKLEPERKMQLAKAL
QNRRARWKTKQLERDYDSLKKQFDVLKSDNDSLLAHNKKLHAELVALKKHDRKESAKIKREFAEASWSNNGSTENNHNNN
SSDANHVSMIKDLFPSSIRSATATTTSTHIDHQIVQDQDQGFCNMFNGIDETTSASYWAWPDQQQQHHNHHQFN*
How do I write regular expressions in vim
to match them, I hope you can help me, thank you very much for reading my question.
That is a multiline search. Try something like the following and modify as required. I have included newline, tab, and alphanumerics in the matched character classes.
^>AT[25]G[\t\n[:alnum:].]*KSP[\t\n[:alnum:].]*\*$