Search code examples
bashcsvawksedcut

joining selected columns from multiple tsv files in bash


I have a bunch of tab separated text files like the following one

"gene_id"   "Pattern1"  "Pattern2"  "Pattern3"  "Pattern4"  "Pattern5"  "MAP"   "PPDE"
"ENSG00000119771.13"    3.11528786599051e-18    2.52650109640992e-13    6.25109524320237e-09    0.345846257420197   0.654153736328455   "Pattern5"  1
"ENSG00000123700.4" 1.75016991626305e-36    3.98804090894939e-19    0.63423772228367    3.8159144080782e-21 0.36576227771633    "Pattern3"  1
"ENSG00000128567.15"    1.10722918612618e-23    7.62691311068806e-07    5.77031364194955e-06    5.13675840911147e-21    0.999993466995047   "Pattern5"  1
"ENSG00000130182.6" 9.75717082221716e-22    1.27675651077242e-12    0.469972541094369   1.13677117238758e-12    0.530027458903217   "Pattern5"  1
"ENSG00000131914.9" 3.1627489688037e-41 1.00274706758683e-22    0.0578584524816503  6.98718794692175e-22    0.94214154751835    "Pattern5"  1

Now I want to join them into one file in the so that I get

"gene_id"   "Pattern5"  "Pattern5"  "Pattern5"  "Pattern5"  "Pattern5"  

where each Pattern5 column comes from one file.

I tried some stuff with

cut -f 6 <file>

and

paste <file1> <file2> ...

but I couldn't combine it correctly.

Thanks for you help!

UPDATE: I try to give you a testable example as input here:

<file1>
gene_id Pattern1    Pattern2    Pattern3    Pattern4    Pattern5
ENSG00000119771 1   2   3   4   5
ENSG00000123700 1   2   3   4   5
ENSG00000128567 1   2   3   4   5
ENSG00000130182 1   2   3   4   5
ENSG00000131914 1   2   3   4   5

<file2>         
gene_id Pattern1    Pattern2    Pattern3    Pattern4    Pattern5
ENSG00000119771 6   7   8   9   10
ENSG00000123700 6   7   8   9   10
ENSG00000128567 6   7   8   9   10
ENSG00000130182 6   7   8   9   10
ENSG00000131914 6   7   8   9   10

<file3>             
gene_id Pattern1    Pattern2    Pattern3    Pattern4    Pattern5
ENSG00000119771 11  12  13  14  15
ENSG00000123700 11  12  13  14  15
ENSG00000128567 11  12  13  14  15
ENSG00000130182 11  12  13  14  15
ENSG00000131914 11  12  13  14  15

and the desired output would be

gene_id Pattern5_file1  Pattern5_file2  Pattern5_file3
ENSG00000119771 5   10  15
ENSG00000123700 5   10  15
ENSG00000128567 5   10  15
ENSG00000130182 5   10  15
ENSG00000131914 5   10  15

UPDATE2: I tried the approach of Ed Morton:

awk '
BEGIN { FS=OFS="\t" } FNR==1{ARGIND++}
{ genes[$1]; val[$1,ARGIND] = $5 }
END {
    for (gene in genes) {
        printf "%s%s", gene, OFS
        for (file=1; file<=ARGIND; file++) {
            printf "%s%s", val[gene,file], (file<ARGIND?OFS:ORS)
        }
    }
} ' $files

but the output is not in the right format:

ENSG00000128567 4   9   14
ENSG00000130182 4   9   14
ENSG00000119771 4   9   14
gene_id Pattern4    Pattern4    Pattern4
ENSG00000131914 4   9   14
ENSG00000123700 4   9   14

Solution

  • try this one

    #!/bin/bash
    
    paste file1 file2 file3 | awk -v patternIdx=6 '
    
    function printPattern(idx, isFirstLine) {
        for (i = 1; i <= NF; ++i) { 
            if (i == 1) 
                printf "%s ", $i;
            else if (isFirstLine && i % patternIdx == 0)
                printf "%s_file%d ", $i, i / patternIdx;
            else if (i % patternIdx == 0)
                printf "%d ", $i;
        }
        printf "\n"
    } 
    { 
        if (NR == 1)
            printPattern(patternIdx, 1);
        else
            printPattern(patternIdx, 0); 
    }'
    

    patternIdx is the column index of Pattern5