Search code examples
bashjoinawk

Joining two files using join and awk


I have the following file called file1:

j_2,2019-04-01,100.00,,,,11
j_7,2019-04-05,-9.78,,,,11
j_1,2019-04-01,20.00,,,,22
j_4,2019-04-04,130.00,,,,22
j_8,2019-04-05,-250.00,,,,22
j_1,2019-04-01,100.00,,,,33
j_3,2019-04-01,-264.56,,,,33
j_5,2019-04-04,130.00,,,,44
j_6,2019-04-04,-12.90,,,,55
j_9,2019-04-08,100.00,,,,66

and file2

11,Accounting,VE
22,Paint,VE
33,Legal,VE
44,Office,VE
55,Maintenance,VE
66,IT,VE

I want to join file1 (on column 7) and file2 (on column 1) to create file3.

j_2,2019-04-01,100.00,,,,Accounting,VE
j_7,2019-04-05,-9.78,,,,Accounting,VE
j_1,2019-04-01,20.00,,,,Paint,VE
j_4,2019-04-04,130.00,,,,Paint,VE
j_8,2019-04-05,-250.00,,,,Paint,VE
j_1,2019-04-01,100.00,,,,Legal,VE
j_3,2019-04-01,-264.56,,,,Legal,VE
j_5,2019-04-04,130.00,,,,Office,VE
j_6,2019-04-04,-12.90,,,,Maintenance,VE
j_9,2019-04-08,100.00,,,,IT,VE

I have tried

join -t ',' -1 7 -2 1 file1 file2 | head -n 2

which results in

11,j_2,2019-04-01,100.00,,,,Accounting,VE
11,j_7,2019-04-05,-9.78,,,,Accounting,VE

Why does a 11, appear in the front and how can I run the join command to get the required result. Furthermore, I also want to be able to do this in awk. I have looked on the internet for solutions and found this but don't understand it fully to apply it to my context. Also would I be able to pass file2 and file1 as variables as I really want to be using this as part of a bash script.

awk 'BEGIN {FS=OFS=","} NR==FNR {h[$1] = $2; next} {print $0,h[$1]}' file2 file1

Help is always much appreciated!


Solution

  • Answering OP's question about why the join column is showing up at the beginning of the line ...

    If no explicit format (-o FORMAT) is provided then join uses a default ouput format (from man join):

    Default FORMAT outputs the join field, (then) the remaining fields from FILE1, (then) the remaining fields from FILE2

    OP will need to provide an explicit format which can get tedious if there are several columns (join does not provide any sort of wildcard/range/regex designators). One example

    join -t',' -1 7 -2 1 -o 1.1,1.2,1.3,1.4,1.5,1.6,2.2,2.3 file1 file2
    

    Another issue which doesn't show up in OP's sample head -2 output, but is mentioned in the ccomments, is that join requires both files to be sorted by the join column.

    In my system OP's join comand generates the following error strewn output:

    $ join -t ',' -1 7 -2 1 file1 file2
    22,j_1,2019-04-01,20.00,,,,Paint,VE
    join: file1:3: is not sorted: j_2,2019-04-01,100.00,,,,11
    33,j_1,2019-04-01,100.00,,,,Legal,VE
    44,j_5,2019-04-04,130.00,,,,Office,VE
    55,j_6,2019-04-04,-12.90,,,,Maintenance,VE
    66,j_9,2019-04-08,100.00,,,,IT,VE
    join: input is not in sorted order
    

    Pulling this all together into a single join command:

    $ join -t ',' -1 7 -2 1 -o 1.1,1.2,1.3,1.4,1.5,1.6,2.2,2.3 <(sort -t ',' -k7,7 file1) <(sort -t ',' -k1,1 file2)
    j_2,2019-04-01,100.00,,,,Accounting,VE
    j_7,2019-04-05,-9.78,,,,Accounting,VE
    j_1,2019-04-01,20.00,,,,Paint,VE
    j_4,2019-04-04,130.00,,,,Paint,VE
    j_8,2019-04-05,-250.00,,,,Paint,VE
    j_1,2019-04-01,100.00,,,,Legal,VE
    j_3,2019-04-01,-264.56,,,,Legal,VE
    j_5,2019-04-04,130.00,,,,Office,VE
    j_6,2019-04-04,-12.90,,,,Maintenance,VE
    j_9,2019-04-08,100.00,,,,IT,VE
    

    NOTE: if you know for a fact that a file is already sorted by the join column then you can replace the associated <(sort -t ',' -kx,x file) with file

    Because of the hassles of having to explicitly list all output columns I'd probably opt for something in awk ...


    One awk idea:

    awk '
    BEGIN   { FS=OFS="," }
    
    FNR==NR { suffix[$1] = $2 OFS $3; next }         # 1st file: store data in array suffix[]
    
            { if ($NF in suffix) {                   # 2nd file: if last column ($NF==$7 in this case) is an index in the suffix[] array ...
                 for (i=1;i<NF;i++)                  # loop through the 1st to (NF-1)th columns ...
                     printf "%s%s", $i, OFS          # print each one to stdout then ...
                 print suffix[$NF]                   # finish off current line of output with the value stored in the array location
              }
            }
    ' file2 <(sort -t ',' -k7,7 file1)               # no need to sort file2 as we will just be doing lookups against the suffix[] array;
                                                     # sort file1 so the output is generated in the same order as OP's desired result
    

    This generates:

    j_2,2019-04-01,100.00,,,,Accounting,VE
    j_7,2019-04-05,-9.78,,,,Accounting,VE
    j_1,2019-04-01,20.00,,,,Paint,VE
    j_4,2019-04-04,130.00,,,,Paint,VE
    j_8,2019-04-05,-250.00,,,,Paint,VE
    j_1,2019-04-01,100.00,,,,Legal,VE
    j_3,2019-04-01,-264.56,,,,Legal,VE
    j_5,2019-04-04,130.00,,,,Office,VE
    j_6,2019-04-04,-12.90,,,,Maintenance,VE
    j_9,2019-04-08,100.00,,,,IT,VE