Search code examples
linuxjoinpastememory-efficient

How to join 'n' number of files in ordered way efficiently using paste/join or linux or perl?


Thousands of files ends with *.tab. First column in each file is a header. Every file have their own headers (so they are different). I don't mind to have one header from any file.

Number of rows are equal in all the files and so have an order. My desired output have the same order.

Example files in a directory

test_1.tab
test_2.tab
.
.
.
.
test_1990.tab
test_2000.tab

test_1.tab

Pro_01 0 0 0 0 0 1 1 1 0 1 1 0 .....0
Pro_02 0 0 0 0 0 1 1 0 0 0 0 0 .....1
Pro_03 1 1 1 1 1 0 0 1 0 1 1 0 .....1
.
.
.
Pro_200 0 0 0 0 1 1 1 1 1 1 0  .....0

test_2000.tab

Pro_1901 1 1 1 1 0 1 1 0 0 0 0 1 .....0
Pro_1902 1 1 1 0 0 0 1 0 0 0 0 0 .....1
Pro_1903 1 1 0 1 0 1 0 0 0 0 0 1 .....1
.
.
.
Pro_2000 1 0 0 0 0 1 1 1 1 1 0  .....0

desired output

Pro_01 0 0 0 0 0 1 1 1 0 1 1 0 0 ..... 1 1 1 1 0 1 1 0 0 0 0 1 0
Pro_02 0 0 0 0 0 1 1 0 0 0 0 0 1 ..... 1 1 1 0 0 0 1 0 0 0 0 0 1
Pro_03 1 1 1 1 1 0 0 1 0 1 1 0 1 ..... 1 1 0 1 0 1 0 0 0 0 0 1 1
.
.
.
Pro_200 0 0 0 0 1 1 1 1 1 1 0 0  ..... 1 0 0 0 0 1 1 1 1 1 0 0

My code

for i in *.tab/; do paste allCol.tab <(cut -f 2- "$i") > itermediate.csv; mv intermediate.csv allCol.tab ; done

paste <(cut -f1 test1.tab) allCol.tab > final.tab
rm allCol.tab

It takes a quite time like 3 hrs. Which is a better way? Also, is there any other command to cross check this output file vs all input files? like diff or wc?


Solution

  • Try this.

    #!/bin/bash    
    
    TMP=tmp
    mkdir "$TMP"
    RESULT=result
    
    #read each file and append the contents of each line in them
    #to a new file for each line in the tmp directory 
    for f in *.tab; do
        i=1
        while read -r l; do
            echo "$l" >> "$TMP"/"$i"
            ((i++))
        done < <(cut -f2- "$f")
    done
    
    #integrate each file in tmp dir into a single line of the $RESULT file
    exec 1>>$RESULT    
    for f in "$TMP"/*; do
        while read -r l; do
            printf '%s\t' "$l"
        done < <(cat "$f")
        echo
    done
    
    rm -r "$TMP"
    

    This algorithm can be split on a number of processors and the task would get done faster.

    You can also add to it things like checking if $TMP was created successfully.