Search code examples
pythonsplitlarge-filescsv

splitting a large tab delimited file in python


I have a large tab delimited file containing about 1.4 million lines and 50 columns. Before I do anything with the data contained in the file I want to split this large file into about a few thousand smaller files. The first column of my file contains position information, and I want each smaller file to be a specific interval based on this information. In separate lists I have the start and stop of each interval that I want to split the larger file by. Here is the part of my code that does this operation, the start and stop positions are contained in lists called start_L and stop_L:

for i in range(len(id)):
   out1=((file%s.txt)%(id[i]))
   table=open('largefile.tsv',"r")
   start=int(start_L[i])
   stop=int(stop_L[i])
   table.next()
   temp_out=open(out1,"w")
   reader=csv.reader(table,delimiter="\t")
   for line in reader:
       if int(line[0]) in range(start,stop):
           for y in line:
               temp_out.write(("%s\t")%(y))
           temp_out.write("\n")
    else:
        if int(line[0]) > stop:
            break
        else:
            pass
print "temporary file..." , id[i]

The above code achieves what I want, but is extremely slow. It can process the first hundred or so intervals in a matter of minutes, but gets exponentially slower with each passing interval, so it takes days to run. Is there a faster, or more efficient way of doing this? I believe the problem is that it has to scan over the entire file to find the positions within the specified interval each time through the loop.


Solution

  • OK, I tried to keep this in the spirit of your code. It only iterates thru the big file once, it does not bother to parse the lines via the csv module as you were just rejoining them during the write.

    id=("a","b")
    start_L=(1,15)
    stop_L=(16,40)
    
    i=0
    table=open('largefile.tsv',"r")
    out1=(("file%s.txt")%(id[i]))
    temp_out=open(out1,"w")
    
    # start iterating through the file 
    for line in table:
         stop=int(stop_L[i])
    
         # Split the line into a position piece, and a 
         # throw away variable based upon the 1st tab char
         position,the_rest= line.split("\t",1)
    
         # I'm ignoring start as you mentioned it was sorted in the file
         if int(position) >= stop :
               # Close the current file
               temp_out.close()
    
               # Increment index so file name is pulled from id properly
               # If the index is past the length of the id list then 
               # break otherwise open the new file for writing
               i += 1  
               if (i < len(id)):
                 out1=(("file%s.txt")%(id[i]))
                 temp_out=open(out1,"w")
               else:
                 break 
    
         temp_out.write(line)
    

    My test file lines looked like

    1       1a      b       c       d       e
    2       2a      b       c       d       e
    3       3a      b       c       d       e
    

    This could be simplified quite a bit depending on your specific data but I hope it at least gives you a start.