I have a tab-delimited file with lines as such:
foo bar bar <tab>x y z<tab>a foo foo
...
Imagine 1,000,000 lines, with up to 200 words per line. each word on average of 5-6 characters.
To the 2nd and 3rd column, I can do this:
with open('test.txt','r') as infile:
column23 = [i.split('\t')[1:3] for i in infile]
or i could use unix, How can i get 2nd and third column in tab delim file in bash?
import os
column23 = [i.split('\t') os.popen('cut -f 2-3 test.txt').readlines()]
Which is faster? Is there any other way to extract the 2nd and 3rd column?
Use neither. Unless it proves to be too slow, use the csv
module, which is far more readable.
import csv
with open('test.txt','r') as infile:
column23 = [ cols[1:3] for cols in csv.reader(infile, delimiter="\t") ]