Search code examples
pythonlarge-data

Python large files, how to find specific lines with a particular string


I am using Python to process data from very large text files (~52GB, 800 million lines each with 30 columns of data). I am trying to find an efficient way to find specific lines. Luckily the string is always in the first column.

The whole thing works, memory is not a problem (I'm not loading it, just opening and closing the file as needed) and I run it on a cluster anyway. Its more about speed. The script takes days to run!

The data looks something like this:

scaffold126     1       C       0:0:20:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0     
scaffold126     2       C       0:0:10:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0
scaffold5112     2       C       0:0:10:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0
scaffold5112     2       C       0:0:10:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0

and I am searching for all the lines that start with a particular string from the first column. I want to process the data and send a summary to a output file. Then I search for all the lines for another string and so on...

I am using something like this:

for (thisScaff in AllScaffs):
    InFile = open(sys.argv[2], 'r')
    for line in InFile:
        LineList = line.split()
        currentScaff = LineList[0]
        if (thisScaff == currentScaff):
            #Then do this stuff...

The main problem seems to be that all 800 million lines have to be looked through to find those that match the current string. Then once I move to another string, all 800 have to be looked through again. I have been exploring grep options but is there another way?

Many thanks in advance!


Solution

  • My first instinct would be to load your data into a database, making sure to create an index from column 0, and then query as needed.

    For a Python approach, try this:

    wanted_scaffs  = set(['scaffold126', 'scaffold5112'])
    files = {name: open(name+'.txt', 'w') for name in wanted_scaffs}
    for line in big_file:
        curr_scaff = line.split(' ', 1)[0] # minimal splitting
        if curr_scaff in wanted_scaffs:
            files[key].write(line)
    for f in files.values():
        f.close()
    

    Then do your summary reports:

    for scaff in wanted_scaffs:
        with open(scaff + '.txt', 'r') as f:
            ... # summarize your data