Search code examples
linuxjoinawk

Full join - bash


I have two tsv files. The first contains a list of airport codes combined with a year. The second the average daily landings of different types of aircraft.

I would like to create a third file that joins the data from the two files even if there is no match between the data in file1 with data in file2. The equivalent of an sql full outer join as I understand it.

The unpopulated cells of the combined output should be blank.

file1 - abbrvtd.tsv

Abbrvtd
LAM1999
LAM1998
LAM1997
LAM1996
LAM1995
LAM1994
TAW1999
TAW1998
TAW1997
TAW1996
TAW1995
TAW1994
DRN1999
DRN1998
DRN1997
DRN1996
DRN1995
DRN1994

file2 - landings.tsv

Abbrvtd Airport Codes   Year    Commercial  Passenger   Private Helicopter
LAM1999 LAM 1999    71.8    59.4    65.2    40.7
LAM1998 LAM 1998    71.8    58.8    64.7    40.5
LAM1997 LAM 1997    71.6    58.3    64.1    40.5
LAM1994 LAM 1994    70.8    55.8    63.1    70.8
TAW1999 TAW 1999    40.7    62.4    71.4    65.2
TAW1997 TAW 1997    40.5    60.9    70.8    64.1
TAW1996 TAW 1996    40.2    60.4    70.4    63.7
TAW1995 TAW 1995    40.2    59.9    70.3    60.9
TAW1994 TAW 1994    39.9    59.1    70.2    60.4
DRN1999 DRN 1999    39.7    58.7    40.5    59.9
DRN1994 DRN 1994    30.2    55.8    59.9    39.6

output desired:

Abbrvtd Airport Codes   Year    Commercial  Passenger   Private Helicopter
LAM1999 LAM 1999    71.8    59.4    65.2    40.7
LAM1998 LAM 1998    71.8    58.8    64.7    40.5
LAM1997 LAM 1997    71.6    58.3    64.1    40.5
LAM1996                     
LAM1995                     
LAM1994 LAM 1994    70.8    55.8    63.1    70.8
TAW1999 TAW 1999    40.7    62.4    71.4    65.2
TAW1998                     
TAW1997 TAW 1997    40.5    60.9    70.8    64.1
TAW1996 TAW 1996    40.2    60.4    70.4    63.7
TAW1995 TAW 1995    40.2    59.9    70.3    60.9
TAW1994 TAW 1994    39.9    59.1    70.2    60.4
DRN1999 DRN 1999    39.7    58.7    40.5    59.9
DRN1998                     
DRN1997                     
DRN1996                     
DRN1995                     
DRN1994 DRN 1994    30.2    55.8    59.9    39.6

The closest I've come is with the code below. but it returns an output that looks like file2-landings.tsv

I have tried using various iterations of awk and now trying with join:

join -a 1 -a 2 -e NULL -t$"\t" -1 1 -2 1 -o 2.1,2.2,2.3,2.4,2.5,2.6,2.7 abbrvtd.tsv landings.tsv

Thank you.

RESOLVED - @EdMorton solution generates the required output


Solution

  • join requires sorted input but your input isn't sorted so at a minimum you'd need to add that step before using join on it. You could do something like this:

    $ { head -n 1 landings.tsv;
        join -t $'\t' -a 1 -a 2 -j 1 \
            <(tail -n +2 abbrvtd.tsv | sort) \
            <(tail -n +2 landings.tsv | sort);
      }
    

    but then you'd still need to pad the lines from abbrvtd.tsv with tabs to create empty fields so you end up with:

    $ { head -n 1 landings.tsv;
        join -t $'\t' -a 1 -a 2 -j 1 \
            <(tail -n +2 abbrvtd.tsv | sort) \
            <(tail -n +2 landings.tsv | sort);
      } | awk 'BEGIN{FS=OFS="\t"} NR==1{nf=NF} {$nf=$nf} 1'
    Abbrvted        Airport Codes   Year    Commercial      Passenger       Private Helicopter
    DRN1994 DRN     1994    30.2    55.8    59.9    39.6
    DRN1995
    DRN1996
    DRN1997
    DRN1998
    DRN1999 DRN     1999    39.7    58.7    40.5    59.9
    LAM1994 LAM     1994    70.8    55.8    63.1    70.8
    LAM1995
    LAM1996
    LAM1997 LAM     1997    71.6    58.3    64.1    40.5
    LAM1998 LAM     1998    71.8    58.8    64.7    40.5
    LAM1999 LAM     1999    71.8    59.4    65.2    40.7
    TAW1994 TAW     1994    39.9    59.1    70.2    60.4
    TAW1995 TAW     1995    40.2    59.9    70.3    60.9
    TAW1996 TAW     1996    40.2    60.4    70.4    63.7
    TAW1997 TAW     1997    40.5    60.9    70.8    64.1
    TAW1998
    TAW1999 TAW     1999    40.7    62.4    71.4    65.2
    

    Alternatively, using any awk, sort, and cut and the Decorate-Sort-Undecorate idiom:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    awk '
        BEGIN { FS=OFS="\t" }
        NR == FNR {
            if ( FNR == 1 ) {
                print 1, $0
                nf = NF
            }
            else {
                vals[$1] = $0
            }
            next
        }
        FNR > 1 {
            if ( $1 in vals ) {
                $0 = vals[$1]
                delete vals[$1]
            }
            $nf = $nf
            print 2, $0
        }
        END {
            for ( key in vals ) {
                print 2, vals[key]
            }
        }
    ' landings.tsv abbrvtd.tsv |
    sort -k1,1n -k2 |
    cut -f 2-
    

    $ ./tst.sh
    Abbrvted        Airport Codes   Year    Commercial      Passenger       Private Helicopter
    DRN1994 DRN     1994    30.2    55.8    59.9    39.6
    DRN1995
    DRN1996
    DRN1997
    DRN1998
    DRN1999 DRN     1999    39.7    58.7    40.5    59.9
    LAM1994 LAM     1994    70.8    55.8    63.1    70.8
    LAM1995
    LAM1996
    LAM1997 LAM     1997    71.6    58.3    64.1    40.5
    LAM1998 LAM     1998    71.8    58.8    64.7    40.5
    LAM1999 LAM     1999    71.8    59.4    65.2    40.7
    TAW1994 TAW     1994    39.9    59.1    70.2    60.4
    TAW1995 TAW     1995    40.2    59.9    70.3    60.9
    TAW1996 TAW     1996    40.2    60.4    70.4    63.7
    TAW1997 TAW     1997    40.5    60.9    70.8    64.1
    TAW1998
    TAW1999 TAW     1999    40.7    62.4    71.4    65.2
    

    So you can see that the added lines from abbrvtd.tsv now have the same number of tab-separated fields as the landings.tsv file did:

    $ ./tst.sh | cat -t
    Abbrvted^IAirport Codes^IYear^ICommercial^IPassenger^IPrivate^IHelicopter
    DRN1994^IDRN^I1994^I30.2^I55.8^I59.9^I39.6
    DRN1995^I^I^I^I^I^I
    DRN1996^I^I^I^I^I^I
    DRN1997^I^I^I^I^I^I
    DRN1998^I^I^I^I^I^I
    DRN1999^IDRN^I1999^I39.7^I58.7^I40.5^I59.9
    LAM1994^ILAM^I1994^I70.8^I55.8^I63.1^I70.8
    LAM1995^I^I^I^I^I^I
    LAM1996^I^I^I^I^I^I
    LAM1997^ILAM^I1997^I71.6^I58.3^I64.1^I40.5
    LAM1998^ILAM^I1998^I71.8^I58.8^I64.7^I40.5
    LAM1999^ILAM^I1999^I71.8^I59.4^I65.2^I40.7
    TAW1994^ITAW^I1994^I39.9^I59.1^I70.2^I60.4
    TAW1995^ITAW^I1995^I40.2^I59.9^I70.3^I60.9
    TAW1996^ITAW^I1996^I40.2^I60.4^I70.4^I63.7
    TAW1997^ITAW^I1997^I40.5^I60.9^I70.8^I64.1
    TAW1998^I^I^I^I^I^I
    TAW1999^ITAW^I1999^I40.7^I62.4^I71.4^I65.2
    

    Aside: If you really did have a different first column name in each file and needed the name from abbrvtd.tsv in the output as shown in the original version of the example in your question then you could do this:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    awk '
        BEGIN { FS=OFS="\t" }
        NR == FNR {
            if ( FNR == 1 ) {
                hdr = $0
                nf = NF
            }
            else {
                vals[$1] = $0
            }
            next
        }
        {
            if ( FNR == 1 ) {
                key = $1
                $0 = hdr
                $1 = key
            }
            else if ( $1 in vals ) {
                $0 = vals[$1]
                delete vals[$1]
            }
            $nf = $nf
            print (FNR > 1), $0
        }
        END {
            for ( key in vals ) {
                print 2, vals[key]
            }
        }
    ' landings.tsv abbrvtd.tsv |
    sort -k1,1n -k2 |
    cut -f 2-
    

    $ ./tst.sh
    Abbrvtd Airport Codes   Year    Commercial      Passenger       Private Helicopter
    DRN1994 DRN     1994    30.2    55.8    59.9    39.6
    DRN1995
    DRN1996
    DRN1997
    DRN1998
    DRN1999 DRN     1999    39.7    58.7    40.5    59.9
    LAM1994 LAM     1994    70.8    55.8    63.1    70.8
    LAM1995
    LAM1996
    LAM1997 LAM     1997    71.6    58.3    64.1    40.5
    LAM1998 LAM     1998    71.8    58.8    64.7    40.5
    LAM1999 LAM     1999    71.8    59.4    65.2    40.7
    TAW1994 TAW     1994    39.9    59.1    70.2    60.4
    TAW1995 TAW     1995    40.2    59.9    70.3    60.9
    TAW1996 TAW     1996    40.2    60.4    70.4    63.7
    TAW1997 TAW     1997    40.5    60.9    70.8    64.1
    TAW1998
    TAW1999 TAW     1999    40.7    62.4    71.4    65.2