Thousands of files ends with *.tab. First column in each file is a header. Every file have their own headers (so they are different). I don't mind to have one header from any file.
Number of rows are equal in all the files and so have an order. My desired output have the same order.
test_1.tab
test_2.tab
.
.
.
.
test_1990.tab
test_2000.tab
Pro_01 0 0 0 0 0 1 1 1 0 1 1 0 .....0
Pro_02 0 0 0 0 0 1 1 0 0 0 0 0 .....1
Pro_03 1 1 1 1 1 0 0 1 0 1 1 0 .....1
.
.
.
Pro_200 0 0 0 0 1 1 1 1 1 1 0 .....0
Pro_1901 1 1 1 1 0 1 1 0 0 0 0 1 .....0
Pro_1902 1 1 1 0 0 0 1 0 0 0 0 0 .....1
Pro_1903 1 1 0 1 0 1 0 0 0 0 0 1 .....1
.
.
.
Pro_2000 1 0 0 0 0 1 1 1 1 1 0 .....0
Pro_01 0 0 0 0 0 1 1 1 0 1 1 0 0 ..... 1 1 1 1 0 1 1 0 0 0 0 1 0
Pro_02 0 0 0 0 0 1 1 0 0 0 0 0 1 ..... 1 1 1 0 0 0 1 0 0 0 0 0 1
Pro_03 1 1 1 1 1 0 0 1 0 1 1 0 1 ..... 1 1 0 1 0 1 0 0 0 0 0 1 1
.
.
.
Pro_200 0 0 0 0 1 1 1 1 1 1 0 0 ..... 1 0 0 0 0 1 1 1 1 1 0 0
for i in *.tab/; do paste allCol.tab <(cut -f 2- "$i") > itermediate.csv; mv intermediate.csv allCol.tab ; done
paste <(cut -f1 test1.tab) allCol.tab > final.tab
rm allCol.tab
It takes a quite time like 3 hrs. Which is a better way? Also, is there any other command to cross check this output file vs all input files? like diff or wc?
Try this.
#!/bin/bash
TMP=tmp
mkdir "$TMP"
RESULT=result
#read each file and append the contents of each line in them
#to a new file for each line in the tmp directory
for f in *.tab; do
i=1
while read -r l; do
echo "$l" >> "$TMP"/"$i"
((i++))
done < <(cut -f2- "$f")
done
#integrate each file in tmp dir into a single line of the $RESULT file
exec 1>>$RESULT
for f in "$TMP"/*; do
while read -r l; do
printf '%s\t' "$l"
done < <(cat "$f")
echo
done
rm -r "$TMP"
This algorithm can be split on a number of processors and the task would get done faster.
You can also add to it things like checking if $TMP
was created successfully.