I have a tab delimited file as such:
this is a sentence. abb what is this foo bar. bev hello foo bar blah black sheep. abb
I could use cut -f1
and cut -f2
in unix terminal to split into two files:
this is a sentence.
what is this foo bar.
hello foo bar blah black sheep.
But is it possible to do the same in python? would it be faster?
I've been doing it as such:
[i.split('\t')[0] for i in open('in.txt', 'r')]
But is it possible to do the same in python?
yes you can:
l1, l2 = [[],[]]
with open('in.txt', 'r') as f:
for i in f:
# will loudly fail if more than two columns on a line
left, right = i.split('\t')
would it be faster?
it's not likely, cut is a C program that is optimized towards that kind of processing, python is a general purpose language which has a great flexibility, but is not necessarily fast.
Though, the only advantage you may get by working with an algorithm such as the one I wrote, is that you read the file only once, whereas with cut, you're reading it twice. That could make the difference.
Though we'd need to run some benchmarking to be 100%.
Here's a small benchmark, on my laptop, for what it's worth:
>>> timeit.timeit(stmt=lambda: t("file_of_606251_lines"), number=1)
% time cut -d' ' -f1 file_of_606251_lines > /dev/null
cut -d' ' -f1 file_of_606251_lines > /dev/null 0.74s user 0.02s system 98% cpu 0.775 total
% time cut -d' ' -f2 file_of_606251_lines > /dev/null
cut -d' ' -f2 file_of_606251_lines > /dev/null 1.18s user 0.02s system 99% cpu 1.215 total
which is 1.990 seconds.
So the python version is indeed faster, as expected ;-)