I have two tsv files. The first contains a list of airport codes combined with a year. The second the average daily landings of different types of aircraft.
I would like to create a third file that joins the data from the two files even if there is no match between the data in file1 with data in file2. The equivalent of an sql full outer join as I understand it.
The unpopulated cells of the combined output should be blank.
file1 - abbrvtd.tsv
Abbrvtd
LAM1999
LAM1998
LAM1997
LAM1996
LAM1995
LAM1994
TAW1999
TAW1998
TAW1997
TAW1996
TAW1995
TAW1994
DRN1999
DRN1998
DRN1997
DRN1996
DRN1995
DRN1994
file2 - landings.tsv
Abbrvtd Airport Codes Year Commercial Passenger Private Helicopter
LAM1999 LAM 1999 71.8 59.4 65.2 40.7
LAM1998 LAM 1998 71.8 58.8 64.7 40.5
LAM1997 LAM 1997 71.6 58.3 64.1 40.5
LAM1994 LAM 1994 70.8 55.8 63.1 70.8
TAW1999 TAW 1999 40.7 62.4 71.4 65.2
TAW1997 TAW 1997 40.5 60.9 70.8 64.1
TAW1996 TAW 1996 40.2 60.4 70.4 63.7
TAW1995 TAW 1995 40.2 59.9 70.3 60.9
TAW1994 TAW 1994 39.9 59.1 70.2 60.4
DRN1999 DRN 1999 39.7 58.7 40.5 59.9
DRN1994 DRN 1994 30.2 55.8 59.9 39.6
output desired:
Abbrvtd Airport Codes Year Commercial Passenger Private Helicopter
LAM1999 LAM 1999 71.8 59.4 65.2 40.7
LAM1998 LAM 1998 71.8 58.8 64.7 40.5
LAM1997 LAM 1997 71.6 58.3 64.1 40.5
LAM1996
LAM1995
LAM1994 LAM 1994 70.8 55.8 63.1 70.8
TAW1999 TAW 1999 40.7 62.4 71.4 65.2
TAW1998
TAW1997 TAW 1997 40.5 60.9 70.8 64.1
TAW1996 TAW 1996 40.2 60.4 70.4 63.7
TAW1995 TAW 1995 40.2 59.9 70.3 60.9
TAW1994 TAW 1994 39.9 59.1 70.2 60.4
DRN1999 DRN 1999 39.7 58.7 40.5 59.9
DRN1998
DRN1997
DRN1996
DRN1995
DRN1994 DRN 1994 30.2 55.8 59.9 39.6
The closest I've come is with the code below. but it returns an output that looks like file2-landings.tsv
I have tried using various iterations of awk and now trying with join:
join -a 1 -a 2 -e NULL -t$"\t" -1 1 -2 1 -o 2.1,2.2,2.3,2.4,2.5,2.6,2.7 abbrvtd.tsv landings.tsv
Thank you.
RESOLVED - @EdMorton solution generates the required output
join
requires sorted input but your input isn't sorted so at a minimum you'd need to add that step before using join
on it. You could do something like this:
$ { head -n 1 landings.tsv;
join -t $'\t' -a 1 -a 2 -j 1 \
<(tail -n +2 abbrvtd.tsv | sort) \
<(tail -n +2 landings.tsv | sort);
}
but then you'd still need to pad the lines from abbrvtd.tsv with tabs to create empty fields so you end up with:
$ { head -n 1 landings.tsv;
join -t $'\t' -a 1 -a 2 -j 1 \
<(tail -n +2 abbrvtd.tsv | sort) \
<(tail -n +2 landings.tsv | sort);
} | awk 'BEGIN{FS=OFS="\t"} NR==1{nf=NF} {$nf=$nf} 1'
Abbrvted Airport Codes Year Commercial Passenger Private Helicopter
DRN1994 DRN 1994 30.2 55.8 59.9 39.6
DRN1995
DRN1996
DRN1997
DRN1998
DRN1999 DRN 1999 39.7 58.7 40.5 59.9
LAM1994 LAM 1994 70.8 55.8 63.1 70.8
LAM1995
LAM1996
LAM1997 LAM 1997 71.6 58.3 64.1 40.5
LAM1998 LAM 1998 71.8 58.8 64.7 40.5
LAM1999 LAM 1999 71.8 59.4 65.2 40.7
TAW1994 TAW 1994 39.9 59.1 70.2 60.4
TAW1995 TAW 1995 40.2 59.9 70.3 60.9
TAW1996 TAW 1996 40.2 60.4 70.4 63.7
TAW1997 TAW 1997 40.5 60.9 70.8 64.1
TAW1998
TAW1999 TAW 1999 40.7 62.4 71.4 65.2
Alternatively, using any awk, sort, and cut and the Decorate-Sort-Undecorate idiom:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { FS=OFS="\t" }
NR == FNR {
if ( FNR == 1 ) {
print 1, $0
nf = NF
}
else {
vals[$1] = $0
}
next
}
FNR > 1 {
if ( $1 in vals ) {
$0 = vals[$1]
delete vals[$1]
}
$nf = $nf
print 2, $0
}
END {
for ( key in vals ) {
print 2, vals[key]
}
}
' landings.tsv abbrvtd.tsv |
sort -k1,1n -k2 |
cut -f 2-
$ ./tst.sh
Abbrvted Airport Codes Year Commercial Passenger Private Helicopter
DRN1994 DRN 1994 30.2 55.8 59.9 39.6
DRN1995
DRN1996
DRN1997
DRN1998
DRN1999 DRN 1999 39.7 58.7 40.5 59.9
LAM1994 LAM 1994 70.8 55.8 63.1 70.8
LAM1995
LAM1996
LAM1997 LAM 1997 71.6 58.3 64.1 40.5
LAM1998 LAM 1998 71.8 58.8 64.7 40.5
LAM1999 LAM 1999 71.8 59.4 65.2 40.7
TAW1994 TAW 1994 39.9 59.1 70.2 60.4
TAW1995 TAW 1995 40.2 59.9 70.3 60.9
TAW1996 TAW 1996 40.2 60.4 70.4 63.7
TAW1997 TAW 1997 40.5 60.9 70.8 64.1
TAW1998
TAW1999 TAW 1999 40.7 62.4 71.4 65.2
So you can see that the added lines from abbrvtd.tsv now have the same number of tab-separated fields as the landings.tsv file did:
$ ./tst.sh | cat -t
Abbrvted^IAirport Codes^IYear^ICommercial^IPassenger^IPrivate^IHelicopter
DRN1994^IDRN^I1994^I30.2^I55.8^I59.9^I39.6
DRN1995^I^I^I^I^I^I
DRN1996^I^I^I^I^I^I
DRN1997^I^I^I^I^I^I
DRN1998^I^I^I^I^I^I
DRN1999^IDRN^I1999^I39.7^I58.7^I40.5^I59.9
LAM1994^ILAM^I1994^I70.8^I55.8^I63.1^I70.8
LAM1995^I^I^I^I^I^I
LAM1996^I^I^I^I^I^I
LAM1997^ILAM^I1997^I71.6^I58.3^I64.1^I40.5
LAM1998^ILAM^I1998^I71.8^I58.8^I64.7^I40.5
LAM1999^ILAM^I1999^I71.8^I59.4^I65.2^I40.7
TAW1994^ITAW^I1994^I39.9^I59.1^I70.2^I60.4
TAW1995^ITAW^I1995^I40.2^I59.9^I70.3^I60.9
TAW1996^ITAW^I1996^I40.2^I60.4^I70.4^I63.7
TAW1997^ITAW^I1997^I40.5^I60.9^I70.8^I64.1
TAW1998^I^I^I^I^I^I
TAW1999^ITAW^I1999^I40.7^I62.4^I71.4^I65.2
Aside: If you really did have a different first column name in each file and needed the name from abbrvtd.tsv in the output as shown in the original version of the example in your question then you could do this:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { FS=OFS="\t" }
NR == FNR {
if ( FNR == 1 ) {
hdr = $0
nf = NF
}
else {
vals[$1] = $0
}
next
}
{
if ( FNR == 1 ) {
key = $1
$0 = hdr
$1 = key
}
else if ( $1 in vals ) {
$0 = vals[$1]
delete vals[$1]
}
$nf = $nf
print (FNR > 1), $0
}
END {
for ( key in vals ) {
print 2, vals[key]
}
}
' landings.tsv abbrvtd.tsv |
sort -k1,1n -k2 |
cut -f 2-
$ ./tst.sh
Abbrvtd Airport Codes Year Commercial Passenger Private Helicopter
DRN1994 DRN 1994 30.2 55.8 59.9 39.6
DRN1995
DRN1996
DRN1997
DRN1998
DRN1999 DRN 1999 39.7 58.7 40.5 59.9
LAM1994 LAM 1994 70.8 55.8 63.1 70.8
LAM1995
LAM1996
LAM1997 LAM 1997 71.6 58.3 64.1 40.5
LAM1998 LAM 1998 71.8 58.8 64.7 40.5
LAM1999 LAM 1999 71.8 59.4 65.2 40.7
TAW1994 TAW 1994 39.9 59.1 70.2 60.4
TAW1995 TAW 1995 40.2 59.9 70.3 60.9
TAW1996 TAW 1996 40.2 60.4 70.4 63.7
TAW1997 TAW 1997 40.5 60.9 70.8 64.1
TAW1998
TAW1999 TAW 1999 40.7 62.4 71.4 65.2