I am merging several dataframes into one and sorting them using unix sort
. Before I write the final sorted data I would like to add a prefix/header to that output.
So, my code is something like:
my_cols = '\t'.join(['CHROM', 'POS', "REF" ....])
my_cmd = ["sort", "-k1,2", "-V", "final_merged.txt"]
with open(output + 'mergedAndSorted.txt', 'w') as sort_data:
sort_data.write(my_cols + '\n')
subprocess.run(my_cmd, stdout=sort_data)
But, this above doe puts my_cols
at the end of the final output file (i.e mergedAndSorted.txt)
I also tried substituting:
sort_data=io.StringIO(my_cols)
but this gives me an error as I had expected.
How can I add that header to the begining of the subprocess output. I believe this can be achieved by a simple code change.
The problem with your code is a matter of buffering; the tldr is that you can fix it like this:
sort_data.write(my_cols + '\n')
sort_data.flush()
subprocess.run(my_cmd, stdout=sort_data)
If you want to understand why it happens, and how the fix solves it:
When you open a file in text mode, you're opening a buffered file. Writes go into the buffer, and the file object doesn't necessarily flush them to disk immediately. (There's also stream-encoding from Unicode to bytes going on, but that doesn't really add a new problem, it just adds two layers where the same thing can happen, so let's ignore that.)
As long as all of your writes are to the buffered file object, that's fine—they get sequenced properly in the buffer, so they get sequenced properly on the disk.
But if you write to the underlying sort_data.buffer.raw
disk file, or to the sort_data.fileno()
OS file descriptor, those writes may get ahead of the ones that went to sort_data
.
And that's exactly what happens when you use the file as a pipe in subprocess
. This doesn't seem to be explained directly, but can be inferred from Frequently Used Arguments:
stdin, stdout and stderr specify the executed program’s standard input, standard output and standard error file handles, respectively. Valid values are
PIPE
,DEVNULL
, an existing file descriptor (a positive integer), an existing file object, andNone
.
This implies pretty strongly—if you know enough about the way piping works on *nix and Windows—that it's passing the actual file descriptor/handle to the underlying OS functionality. But it doesn't actually say that. To really be sure, you have to check the Unix source and Windows source, where you can see that it is calling fileno
or msvcrt.get_osfhandle
on the file objects.