I am using Python to process data from very large text files (~52GB, 800 million lines each with 30 columns of data). I am trying to find an efficient way to find specific lines. Luckily the string is always in the first column.
The whole thing works, memory is not a problem (I'm not loading it, just opening and closing the file as needed) and I run it on a cluster anyway. Its more about speed. The script takes days to run!
The data looks something like this:
scaffold126 1 C 0:0:20:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold126 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
and I am searching for all the lines that start with a particular string from the first column. I want to process the data and send a summary to a output file. Then I search for all the lines for another string and so on...
I am using something like this:
for (thisScaff in AllScaffs):
InFile = open(sys.argv[2], 'r')
for line in InFile:
LineList = line.split()
currentScaff = LineList[0]
if (thisScaff == currentScaff):
#Then do this stuff...
The main problem seems to be that all 800 million lines have to be looked through to find those that match the current string. Then once I move to another string, all 800 have to be looked through again. I have been exploring grep options but is there another way?
Many thanks in advance!
My first instinct would be to load your data into a database, making sure to create an index from column 0, and then query as needed.
For a Python approach, try this:
wanted_scaffs = set(['scaffold126', 'scaffold5112'])
files = {name: open(name+'.txt', 'w') for name in wanted_scaffs}
for line in big_file:
curr_scaff = line.split(' ', 1)[0] # minimal splitting
if curr_scaff in wanted_scaffs:
files[key].write(line)
for f in files.values():
f.close()
Then do your summary reports:
for scaff in wanted_scaffs:
with open(scaff + '.txt', 'r') as f:
... # summarize your data