I have 2 huge flat files in Unix(Solaris), each say about 500-600 GB. And i need join and merge the 2 files into a single flat file using the first column which would be a key index column. How could i do it in an optimized way?
Basically it should be an inner join between the 2 flat files. Reason try to use flat files is, we have a 2 huge table that have been split into 2 separate tables, and that is extracted into 2 flat files, and i am trying to join it at Unix level instead of at database level.
I did use the below commands :
sort -n file1 > file_temp1;
sort -n file2 > file_temp2;
join -j 1 -t';' file_temp1 file_temp2 > Final
It works fine with sort as the 1st field is the index column. However when the join happens, i get hardly 2% of the data in the Final file. So just was trying to understand what mistake i am doing in the join command? Both the files contain about .2 million records and all of the records are matching between the 2 files. I want to have a performance check if the join made at unix would be better than that performed at the database level. Sorry for incomplete question! The first field is a numeric index field. do we have something like a"-n" switch to indicate the join that the first field is a numeric index?
You should not sort -n
, since join has no corresponding -n
flag. Just keep all the leading/trailing whitespace as it is:
#!/bin/sh
sort -t';' -k 1 file1 > file1.srt
sort -t';' -k 1 file2 > file2.srt
join -t';' -1 1 -2 1 file1.srt file2.srt > both
#cat both