Search code examples
pythonmergeconcatenationmemory-efficient

How to perform efficient concatentation (merge) in Python


I have about 50 big dataset that have about 200K-500K number of columns and I am trying to think of a way to merge/concatenate these dataset efficiently. What is the fastest way to do conditional column concatenation (merging) of these files?

Currently, I have a code that works that is listed below but this code takes several hours(at least 12 hours) to do the job for my dataset. Keeping in mind that these input files(datasets) will be extremely large, is there any way to tweak this code to use as minimum memory as possible? One clue I came up with (from looking at my code below) is to close files after opening them, but I am not sure how to do that.

Note that:
a.  All files have the same number of rows
b.  The first two columns are the same throughout the files
c.  All files are tab delimited
d.  This code works but it is ridiculously slow!

My code stated below works for the sample dataset. Like my large datasets the data sets below have the same first two columns. I appreciate any feedback or suggestion on how to make the code run efficiently or alternative methods to do the job efficiently.

Input 1: test_c1_k2_txt.gz :-
c1  c2  1.8 1.9 1.7
L1  P   0.5 1.4 1.1
L2  P   0.4 1.8 1.2
L3  P   0.1 1.9 1.3

Input 2: test_c1_k4_txt.gz :-
c1  c2  0.1 0.9 1.1 1.2
L1  P   1.8 1.7 1.8 2.8
L2  P   1.3 1.4 1.2 1.1
L3  P   1.7 1.6 1.5 1.4

Input 3: test_c3_k1_txt.gz :-
c1  c2  1.3 1.4
L1  P   1.1 2.9
L2  P   2.2 1.4
L3  P   1.7 1.6

Output : - test_all_c_all_k_concatenated.txt.gz :-
c1  c2  1.8 1.9 1.7 0.1 0.9 1.1 1.2 1.3 1.4
L1  P   0.5 1.4 1.1 1.8 1.7 1.8 2.8 1.1 2.9
L2  P   0.4 1.8 1.2 1.3 1.4 1.2 1.1 2.2 1.4
L3  P   0.1 1.9 1.3 1.7 1.6 1.5 1.4 1.7 1.6

Python Code for Merging/Concatenation

import os,glob,sys,gzip,time


start_time=time.time()

max_c=3
max_k=4

filearr=[]

# Loop through the files, in the order of “c” first and then in the order of “k” and create a file array
for c in range(1,max_c):
    for k in range(1,max_k):
    # Set my string of file name
        fname= "test_c"+str(c)+"_k"+str(k)+"_txt.gz"
    # If the file name specified exists, ..
        if os.path.exists(fname):
            print ("Input file "+ fname+ " exists ... ")
        # Open files and create a list array
            files=[gzip.open(f) for f in glob.glob(fname)]
        filearr=filearr+files

# Initialize a list array to append columns to
d=[]
for file in filearr:
    # row strip each line for each file
    row_list=[line.rstrip().split('\t') for line in file.readlines()]
    # Transpose the list array to make columns for each file
    row_list_t=[[r[col] for r in row_list] for col in range(len(row_list[0]))]
    # Combine the transposed rows from each file into one file
    d=d+row_list_t

# Initialize an empty array
temp=[]
for i in (d):
        # Append new columns each time
    if i not in temp:
         temp.append(i)
appended=[[r[col] for r in temp] for col in range(len(temp[0]))]

# Write output dataset into a tab delimited file
outfile=gzip.open('all_c_all_k_concatenated.txt.gz','w')
for i in appended:
    for j in i[:-1]:
        outfile.write(j+'\t')
    outfile.write(i[-1]+'\n')
outfile.close()
print 'executed prob file concatenation sucessfully. '

total_time=time.time() - start_time
print "Total time it took to finish: ", total_time

Solution

  • The following code is an efficient way to handle the data merge problem. It opens all the files. Then it copies the first line from the first data file -- this is the two column header plus all the values. Next, for each input file except for the first, it reads a line, zaps the first two header columns, and writes it to the output dataset. Each input file's values are separated from the others.

    Have fun!

    #!/usr/bin/env python
    
    import glob, gzip, re
    
    data_files = [ gzip.open(name) for name in sorted(
        glob.glob('*_txt.gz')
    ) ]
    
    # we'll use the two header columns from the first file
    firstf = data_files.pop(0)
    
    outf = gzip.open('all_c_all_k_concatenated.txt.gz', 'w')
    for recnum,fline in enumerate( firstf ):
    
        print 'record', recnum+1
    
        # output header columns plus first batch of data
        outf.write( fline.rstrip() )
    
        # separate first file's values from others
        outf.write( ' ' )
    
        # for each input, read one line of data, write values
        for dataf in data_files:
            # read line with headers and values
            line = dataf.next()
    
            # zap two header columns
            line = re.sub(r'^\S+\s+\S+\s+', '', line)
    
            outf.write( line.rstrip() )
    
            # separate this file's values from next
            outf.write( ' ' )
    
        # finish the line of data
        outf.write( '\n' )
    
    outf.close()