I am trying to read a gtf file and then edit it (using subprocess, grep and awk) before loading into pandas.
I have a file name that has header info (indicated by #
), so I need to grep that and remove it first. I can do it in python but I want to introduce grep
into my pipeline to make processing more efficient.
I tried doing:
import subprocess
from io import StringIO
gtf_file = open('chr2_only.gtf', 'r').read()
gtf_update = subprocess.Popen(["grep '^#' " + StringIO(gtf_file)], shell=True)
and
gtf_update = subprocess.Popen(["grep '^#' " + gtf_file], shell=True)
Both of these codes throw an error, for the 1st attempt it was:
Traceback (most recent call last):
File "/home/everestial007/PycharmProjects/stitcher/pHASE-Stitcher-Markov/markov_final_test/phase_to_vcf.py", line 39, in <module> gtf_update = subprocess.Popen(["grep '^#' " + StringIO(gtf_file)], shell=True)
TypeError: Can't convert '_io.StringIO' object to str implicitly
However, if I specify the filename directly it works:
gtf_update = subprocess.Popen(["grep '^#' chr2_only.gtf"], shell=True)
and the output is:
<subprocess.Popen object at 0x7fc12e5ea588>
#!genome-build v.1.0
#!genome-version JGI8X
#!genome-date 2008-12
#!genome-build-accession GCA_000004255.1
#!genebuild-last-updated 2008-12
Could someone please provide different examples for problem like this, and also explain why am I getting the error and why/how it would be possible to run subprocess directly on files loaded on console/memory?
I also tried using subprocess
with call, check_call, check_output, etc.
, but I've gotten several different error messages, like these:
OSError: [Errno 7] Argument list too long
and
Subprocess in Python: File Name too long
Here is a possible solution that allows you to send a string to grep. Essentially, you declare in the Popen
constructor that you want to communicate with the called program via stdin and stdout. You then send the input via communicate and receive the output as return value from communicate.
#!/usr/bin/python
import subprocess
gtf_file = open('chr2_only.gtf', 'r').read()
gtf_update = subprocess.Popen(["grep '^#' "], shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
# stdout, stderr (but latter is empty)
gtf_filtered, _ = gtf_update.communicate(gtf_file)
print gtf_filtered
Note that it is wise not to use shell=True
. Therefore, the Popen line should be written as
gtf_update = subprocess.Popen(["grep", '^#'], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
The rationale is that you don't need the shell to parse the arguments to a single executable. So you avoid unnecessary overhead. It is also better from a security point of view, at least if some argument is potentially unsafe as it comes from a user (think of a filename containing |
). (This is obviously not the case here.)
Note that from a performance point of view, I expect that reading the file directly with grep
is faster than first reading the file with python, and then sending it to grep.