I am having trouble passing a string in python (python variable) as input to a sequence alignment program (muscle
) on the command line (bash). muscle
can take stdin from the command line, e.g.;
~# echo -e ">1\nATTTCTCT\n>2\nATTTCTCC" | muscle
MUSCLE v3.8.31 by Robert C. Edgar
http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
- 2 seqs, max length 8, avg length 8
00:00:00 22 MB(2%) Iter 1 100.00% K-mer dist pass 1
00:00:00 22 MB(2%) Iter 1 100.00% K-mer dist pass 2
00:00:00 23 MB(2%) Iter 1 100.00% Align node
00:00:00 23 MB(2%) Iter 1 100.00% Root alignment
>1
ATTTCTCT
>2
ATTTCTCC
It is this fasta alignment (the last 4 lines) that I am after - you can redirect the output of muscle (echo -e ">1\nATTTCTCT\n>2\nATTTCTCC" | muscle > out.file
to get just the fasta alignment, which I need for downstream processing. BUt to get there, I have to pass 'muscle' the FASTA sequence string, which I think is best done via echo
as above in bash.
So, the script takes two multiFASTA files and and pairs each FASTA sequence based on a list of ids for each file - this is working (though I realize it is perhaps not the most efficient way to go about it - I am new python user). Then I need to align each set in muscle
before calculating distances/difference.
Here is what I have so far :
#! /env/python
from pairwise_distances import K2Pdistance
import pairwise_distances
import subprocess
from subprocess import call
import os
fasta1=['']
for line in open('test1.fasta'):
if not line.startswith('>'):
fasta1.append(line.strip())
fasta2=['']
for line in open('test2.fasta'):
if not line.startswith('>'):
fasta2.append(line.strip())
for l1, l2 in zip(open('test1.list'), open ('test2.list')):
try:
a=fasta1[int(l1)]
except IndexError,e:
a="GGG"
try:
b=fasta2[int(l2)]
except (IndexError):
b="CCC"
temp_align=str(">1"+'\n'+a+'\n'+">2"+'\n'+b)
first=subprocess.check_output(['echo','-e',temp_align])
print first
subprocess.call(['bash','muscle'], stdin=first.stdout)
print second
#new=K2Pdistance(outfast1,outfast2)
#subprocess.Popen(['bash','muscle'], stdin=subprocess.check_output(['echo','-e',temp_align], stdout=subprocess.PIPE).std.out)`
The 'temp_align' variable is what I want to pass to muscle - it is the result of combining the appropriate fasta sequence from each multiFASTA file, for each loop over the ids/list, and is formatted like a FASTA file.
This issue is that I can echo
the FASTA string, but I cant seem to "pipe" that via stdin to muscle... The main error I get is: AttributeError: 'str' object has no attribute 'stdout'
.
~#python Beta3.py
>1
ATCGACTACT
>2
ATCGCGCTACT
Traceback (most recent call last):
File "Beta3.py", line 38, in <module>
subprocess.call(['bash','muscle'], stdin=first.stdout)
AttributeError: 'str' object has no attribute 'stdout'
Can subprocess or other commandline modules take strings as stdin? IF not, I am not sure how I can echo and then pipe into muscle... Any other ideas would be much appreciated. I tried a few options from this thread unix.stackexchange
but no luck yet.
EDIT: Here are some example files:
~# cat test1.fasta
>1
ATCGACTACT
>2
ACTCAGTCA
>3
TTCACAGGG
~# cat test2.fasta
>1
ATCGCGCTACT
>2
GGCGTCAGTCA
>3
TTCAACCCCAGGG
~# cat test1.list
1
2
3
~# cat test2.list
1
3
2
EDIT 2: That previous post is related, but my question is about linking two bash commands starting with a python variable that is a string...and then, ideally, capturing the stdout of the second command back into a python variable... I dont quite see how to translate the answers on that post into a solution for my specific question...I guess I don;t fully understand what the poster was trying to do.
It seems that you want to communicate with the muscle
process, then you need a PIPE, use this
(out, err) = subprocess.Popen(['muscle'], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate(temp_align)
print out