I have about 50 big dataset that have about 200K-500K number of columns and I am trying to think of a way to merge/concatenate these dataset efficiently. What is the fastest way to do conditional column concatenation (merging) of these files?
Currently, I have a code that works that is listed below but this code takes several hours(at least 12 hours) to do the job for my dataset. Keeping in mind that these input files(datasets) will be extremely large, is there any way to tweak this code to use as minimum memory as possible? One clue I came up with (from looking at my code below) is to close files after opening them, but I am not sure how to do that.
Note that:
a. All files have the same number of rows
b. The first two columns are the same throughout the files
c. All files are tab delimited
d. This code works but it is ridiculously slow!
My code stated below works for the sample dataset. Like my large datasets the data sets below have the same first two columns. I appreciate any feedback or suggestion on how to make the code run efficiently or alternative methods to do the job efficiently.
Input 1: test_c1_k2_txt.gz :-
c1 c2 1.8 1.9 1.7
L1 P 0.5 1.4 1.1
L2 P 0.4 1.8 1.2
L3 P 0.1 1.9 1.3
Input 2: test_c1_k4_txt.gz :-
c1 c2 0.1 0.9 1.1 1.2
L1 P 1.8 1.7 1.8 2.8
L2 P 1.3 1.4 1.2 1.1
L3 P 1.7 1.6 1.5 1.4
Input 3: test_c3_k1_txt.gz :-
c1 c2 1.3 1.4
L1 P 1.1 2.9
L2 P 2.2 1.4
L3 P 1.7 1.6
Output : - test_all_c_all_k_concatenated.txt.gz :-
c1 c2 1.8 1.9 1.7 0.1 0.9 1.1 1.2 1.3 1.4
L1 P 0.5 1.4 1.1 1.8 1.7 1.8 2.8 1.1 2.9
L2 P 0.4 1.8 1.2 1.3 1.4 1.2 1.1 2.2 1.4
L3 P 0.1 1.9 1.3 1.7 1.6 1.5 1.4 1.7 1.6
Python Code for Merging/Concatenation
import os,glob,sys,gzip,time
start_time=time.time()
max_c=3
max_k=4
filearr=[]
# Loop through the files, in the order of “c” first and then in the order of “k” and create a file array
for c in range(1,max_c):
for k in range(1,max_k):
# Set my string of file name
fname= "test_c"+str(c)+"_k"+str(k)+"_txt.gz"
# If the file name specified exists, ..
if os.path.exists(fname):
print ("Input file "+ fname+ " exists ... ")
# Open files and create a list array
files=[gzip.open(f) for f in glob.glob(fname)]
filearr=filearr+files
# Initialize a list array to append columns to
d=[]
for file in filearr:
# row strip each line for each file
row_list=[line.rstrip().split('\t') for line in file.readlines()]
# Transpose the list array to make columns for each file
row_list_t=[[r[col] for r in row_list] for col in range(len(row_list[0]))]
# Combine the transposed rows from each file into one file
d=d+row_list_t
# Initialize an empty array
temp=[]
for i in (d):
# Append new columns each time
if i not in temp:
temp.append(i)
appended=[[r[col] for r in temp] for col in range(len(temp[0]))]
# Write output dataset into a tab delimited file
outfile=gzip.open('all_c_all_k_concatenated.txt.gz','w')
for i in appended:
for j in i[:-1]:
outfile.write(j+'\t')
outfile.write(i[-1]+'\n')
outfile.close()
print 'executed prob file concatenation sucessfully. '
total_time=time.time() - start_time
print "Total time it took to finish: ", total_time
The following code is an efficient way to handle the data merge problem. It opens all the files. Then it copies the first line from the first data file -- this is the two column header plus all the values. Next, for each input file except for the first, it reads a line, zaps the first two header columns, and writes it to the output dataset. Each input file's values are separated from the others.
Have fun!
#!/usr/bin/env python
import glob, gzip, re
data_files = [ gzip.open(name) for name in sorted(
glob.glob('*_txt.gz')
) ]
# we'll use the two header columns from the first file
firstf = data_files.pop(0)
outf = gzip.open('all_c_all_k_concatenated.txt.gz', 'w')
for recnum,fline in enumerate( firstf ):
print 'record', recnum+1
# output header columns plus first batch of data
outf.write( fline.rstrip() )
# separate first file's values from others
outf.write( ' ' )
# for each input, read one line of data, write values
for dataf in data_files:
# read line with headers and values
line = dataf.next()
# zap two header columns
line = re.sub(r'^\S+\s+\S+\s+', '', line)
outf.write( line.rstrip() )
# separate this file's values from next
outf.write( ' ' )
# finish the line of data
outf.write( '\n' )
outf.close()