I'm trying to edit an MSA (Multiple Sequence Alignment) file generated by ClustalW, to trim sequences before the consensus one, using BioPython. xxx refers to other bases not relevant here
Here's the example I/O :
INPUT
ITS_primer_fw --------------------------------CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence CCACCCCAACAAGGGCGGCCACGCGGTCCGCTCGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxxxxxxxxx
PRL2010 ACACCCCCGAAAGGGCGTCC------CCTGCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
BBF32_3 ACACACCCACAAGGGCGAGCAGGCG----GCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxx
BBFCG32 CAACACCACACCGGGCGAGCGGG-------CTCGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxxxxxxxx
EXPECTED OUTPUT
ITS_primer_fw CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence CGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxxxxxx
PRL2010 CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxxxx
BBF32_3 CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxx
BBFCG32 CGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxxxxxx
The documented code for AlignIO
describes just a way to extract sequences by treating the alignment as an array. In this example
align = AlignIO.read(input_file, "clustal")
sub_alignment = align[:,20:]
I was able to extract a subalignment made by all the sequences (:) starting from the 20th nucleotide. I'm looking for a way to replace the 20
in the example with the position of the first nucleotide of the consensus sequence.
Found an answer thanks to a Biostars user.
The twitch is looking into column to find the start point, which will be as expected after the last '-'. By default into my alignment first row is the shortest, and starts with '-' before the alignment get good.
So here's the code:
aln = AlignIO.read(input_file, "clustal")
for col in range(aln.get_alignment_length()): # search into column
res = list(aln[:,col])
if not '-' in res:
position = col # find start point
print('First full column is {}'.format(col))
break
print(aln[:,position::]) # print the whole alignment starting from the position variable found