Search code examples
pythonregexnumpybioinformaticsbiopython

Trim sequences based on alignment


I'm trying to edit an MSA (Multiple Sequence Alignment) file generated by ClustalW, to trim sequences before the consensus one, using BioPython. xxx refers to other bases not relevant here

Here's the example I/O :

INPUT

ITS_primer_fw               --------------------------------CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence      CCACCCCAACAAGGGCGGCCACGCGGTCCGCTCGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxxxxxxxxx
PRL2010                     ACACCCCCGAAAGGGCGTCC------CCTGCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
BBF32_3                     ACACACCCACAAGGGCGAGCAGGCG----GCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxx
BBFCG32                     CAACACCACACCGGGCGAGCGGG-------CTCGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxxxxxxxx

EXPECTED OUTPUT

ITS_primer_fw               CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence      CGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxxxxxx
PRL2010                     CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxxxx
BBF32_3                     CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxx
BBFCG32                     CGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxxxxxx

The documented code for AlignIO describes just a way to extract sequences by treating the alignment as an array. In this example

align = AlignIO.read(input_file, "clustal")
sub_alignment = align[:,20:]

I was able to extract a subalignment made by all the sequences (:) starting from the 20th nucleotide. I'm looking for a way to replace the 20 in the example with the position of the first nucleotide of the consensus sequence.


Solution

  • Found an answer thanks to a Biostars user.

    The twitch is looking into column to find the start point, which will be as expected after the last '-'. By default into my alignment first row is the shortest, and starts with '-' before the alignment get good.

    So here's the code:

    aln = AlignIO.read(input_file, "clustal")
    for col in range(aln.get_alignment_length()):  # search into column
        res = list(aln[:,col])
        if not '-' in res:
            position = col                         # find start point
            print('First full column is {}'.format(col))
            break
    print(aln[:,position::])                       # print the whole alignment starting from the position variable found