How to subprocess the files on console directly (with or without using StringIO)?

I am trying to read a gtf file and then edit it (using subprocess, grep and awk) before loading into pandas.

I have a file name that has header info (indicated by #), so I need to grep that and remove it first. I can do it in python but I want to introduce grep into my pipeline to make processing more efficient.

I tried doing:

import subprocess
from io import StringIO

gtf_file = open('chr2_only.gtf', 'r').read()
gtf_update = subprocess.Popen(["grep '^#' " + StringIO(gtf_file)], shell=True)

and

gtf_update = subprocess.Popen(["grep '^#' " + gtf_file], shell=True)

Both of these codes throw an error, for the 1st attempt it was:

Traceback (most recent call last):
  File "/home/everestial007/PycharmProjects/stitcher/pHASE-Stitcher-Markov/markov_final_test/phase_to_vcf.py", line 39, in <module> gtf_update = subprocess.Popen(["grep '^#' " + StringIO(gtf_file)], shell=True)
TypeError: Can't convert '_io.StringIO' object to str implicitly

However, if I specify the filename directly it works:

gtf_update = subprocess.Popen(["grep '^#' chr2_only.gtf"], shell=True)

and the output is:

<subprocess.Popen object at 0x7fc12e5ea588>
#!genome-build v.1.0
#!genome-version JGI8X
#!genome-date 2008-12
#!genome-build-accession GCA_000004255.1
#!genebuild-last-updated 2008-12

Could someone please provide different examples for problem like this, and also explain why am I getting the error and why/how it would be possible to run subprocess directly on files loaded on console/memory?

I also tried using subprocess with call, check_call, check_output, etc., but I've gotten several different error messages, like these:

OSError: [Errno 7] Argument list too long

and

Subprocess in Python: File Name too long

Solution

Here is a possible solution that allows you to send a string to grep. Essentially, you declare in the Popen constructor that you want to communicate with the called program via stdin and stdout. You then send the input via communicate and receive the output as return value from communicate.

#!/usr/bin/python

import subprocess

gtf_file = open('chr2_only.gtf', 'r').read()
gtf_update = subprocess.Popen(["grep '^#' "], shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)

# stdout, stderr (but latter is empty)
gtf_filtered, _ = gtf_update.communicate(gtf_file)

print gtf_filtered

Note that it is wise not to use shell=True. Therefore, the Popen line should be written as

gtf_update = subprocess.Popen(["grep", '^#'], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE)

The rationale is that you don't need the shell to parse the arguments to a single executable. So you avoid unnecessary overhead. It is also better from a security point of view, at least if some argument is potentially unsafe as it comes from a user (think of a filename containing |). (This is obviously not the case here.)

Note that from a performance point of view, I expect that reading the file directly with grep is faster than first reading the file with python, and then sending it to grep.