Search code examples
bashshellmultiple-columns

Join on two columns from seperate files


I have two files:

File1:

   chr1 100736194  46   0.731   +   100735713   100736636
   chr1 100736194  49   0.879   +   100735723   100736646
   chr1 100736196  54   0.952   +   100735753   100736666

File2:

   chr1 100735713  100736636  +  BMCHAS
   chr1 100735723  100736646  +  ATCGSG
   chr1 100735753  100736666  +  BDUSUS

I want to join file 1 and 2 based on

  • file1's column 6 and 7, and
  • file2's column 2 and 3

to output something like:

   chr1 100736194  46   0.731   +   100735713   100736636  BMCHAS
   chr1 100736194  49   0.879   +   100735723   100736646  ATCGSG
   chr1 100736196  54   0.952   +   100735753   100736666  BDUSUS

I don't know how to use join for more than a single column and I looked up an awk solution and have been trying to implement that but I think I am not right somewhere, the command that I am using is: awk 'NR==FNR{a[$1,$2]=$5;next}{$3=a[$7,$8];print}' OFS='\t' file2 file1,

Also, it is important to match not one, but two columns from each file, because the single columns are not unique enough. This means I cannot use single-column matching solutions such as only matching file1's column 6 with file2's column 2 only, it must use the multiple specified columns above.

Thanks in advance.


Solution

  • You can still use join. The trick is to attach the two columns together so they get treated as one.

    I've done it here with sed but you could use awk or whatever you like.

    The <(..) syntax is bash-specific and means approximately "create a temporary named pipe and pass that as a file".

    join \
       -1 6  \
       -2 2  \
       -o "1.1 1.2 1.3 1.4 1.5 0 2.4"  \
       <(sed 's/\([0-9]\{9\}\) *\([0-9]\{9\}\)/\1-\2/' file1) \
       <(sed 's/\([0-9]\{9\}\) *\([0-9]\{9\}\)/\1-\2/' file2) \
     | sed 's/-/  /'
    

    This works with GNU sed (I have 4.2.2); you may need to make adjustments for other sed implementations.

    The input sed commands convert your input files to look like this:

    chr1 100736194  46   0.731   +   100735713-100736636
    

    The output sed command undoes the change.

    The output format (-o "1.1 1.2 1.3 1.4 1.5 0 2.4") gives the output you asked for:

    chr1 100736194 46 0.731 + 100735713  100736636 BMCHAS
    chr1 100736194 49 0.879 + 100735723  100736646 ATCGSG
    chr1 100736196 54 0.952 + 100735753  100736666 BDUSUS