Search code examples
pythonunixcsvcut

How to split csv files into multiple files using the delimiter? python


I have a tab delimited file as such:

this is a sentence. abb what is this foo bar. bev hello foo bar blah black sheep. abb

I could use cut -f1 and cut -f2 in unix terminal to split into two files:

this is a sentence.
what is this foo bar.
hello foo bar blah black sheep.

and:

abb
bev
abb

But is it possible to do the same in python? would it be faster?

I've been doing it as such:

[i.split('\t')[0] for i in open('in.txt', 'r')]

Solution

  • But is it possible to do the same in python?

    yes you can:

    l1, l2 = [[],[]]
    
    with open('in.txt', 'r') as f:
        for i in f:
            # will loudly fail if more than two columns on a line
            left, right = i.split('\t')
            l1.append(left)
            l2.append(right)
    
    print("\n".join(l1))
    print("\n".join(l2))
    

    would it be faster?

    it's not likely, cut is a C program that is optimized towards that kind of processing, python is a general purpose language which has a great flexibility, but is not necessarily fast.

    Though, the only advantage you may get by working with an algorithm such as the one I wrote, is that you read the file only once, whereas with cut, you're reading it twice. That could make the difference.

    Though we'd need to run some benchmarking to be 100%.

    Here's a small benchmark, on my laptop, for what it's worth:

    >>> timeit.timeit(stmt=lambda: t("file_of_606251_lines"), number=1)
    1.393364901014138
    

    vs

    % time cut -d' ' -f1 file_of_606251_lines > /dev/null
    cut -d' ' -f1 file_of_606251_lines > /dev/null  0.74s user 0.02s system 98% cpu 0.775 total
    % time cut -d' ' -f2 file_of_606251_lines > /dev/null
    cut -d' ' -f2 file_of_606251_lines > /dev/null  1.18s user 0.02s system 99% cpu 1.215 total
    

    which is 1.990 seconds.

    So the python version is indeed faster, as expected ;-)