Search code examples
perlawkfilemaker

How to use AWK to solve this 2 Table Data Join?


I have 2 data tables as shown (they are 2 x tab-delimited files). I'm trying to populate the Table-2 Country column with the corresponding country from Table-1. Need to "join" from info in Table-2's Firstname field.

2 INPUT TABLES & DESIRED RESULT

What's the best approach here, given the complexity of the data in the Table-2, Firstname column? Would other Mac tools be better to use than AWK e.g. Excel formulae, Perl, Filemaker etc?

TABLE1 (Input):

city_ascii  country iso2
Mavinga Angola  AO
Menongue    Angola  AO
Mucusso Angola  AO
Guines  Cuba    CU
Havana  Cuba    CU
Holguin Cuba    CU
Las Tunas   Cuba    CU
Manzanillo  Cuba    CU
Matanzas    Cuba    CU
Moron   Cuba    CU
Santa Clara Cuba    CU
Varadero    Cuba    CU

TABLE2 (Input):

Firstname
Fred, Havana
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)

TABLE2 (Result):

Firstname   Country
Fred, Havana  Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba    Cuba
Josh Cuba   Cuba
Gary, Mavinga & Other, Angola   Angola
Jamie, (Angola) Angola

============ Here is debugging info in answer to Ed's Qs below:

awk -F'\t' '{print NF"<"$1"><"$2"><"$3">"}' Table3.txt | cat -v

    1<city_ascii  country iso2><><>
    1<Mavinga Angola  AO><><>
    1<Menongue    Angola  AO><><>
    1<Mucusso Angola  AO><><>
    1<Guines  Cuba    CU><><>
    1<Havana  Cuba    CU><><>
    1<Holguin Cuba    CU><><>
    1<Las Tunas   Cuba    CU><><>
    1<Manzanillo  Cuba    CU><><>
    1<Matanzas    Cuba    CU><><>
    1<Moron   Cuba    CU><><>
    1<Santa Clara Cuba    CU><><>
    1<Varadero    Cuba    CU><><>

    ==============
    awk -F'\t' '{print NF"<"$1"><"$2"><"$3">"}' Table4.txt | cat -v

    1<Firstname><><>
    1<Fred, Havana><><>
    1<James, (Varadero, Cuba)><><>
    1<Jack (Cuba)><><>
    1<Harry Varadero, Cuba><><>
    1<Josh Cuba><><>
    1<Gary, Mavinga & Other, Angola><><>
    1<Jamie, (Angola)><><>

    ===============
    cat -v tst.awk

    BEGIN { FS=OFS="\t" }
    NR==FNR {
        map[$1] = $2
        map[$2] = $2
        next
    }
    FNR==1 {
        print
        FS=" "
        next
    }
    {
        orig = $0
        country = ""
        gsub(/[^[:alpha:]]/," ")
        for (i=NF; i>0; i--) {
            if ($i in map) {
                country = map[$i]
                break
            }
        }
        print orig, country
    }

    ===============
    awk -f tst.awk Table3.txt Table4.txt >output.txt

    Firstname
    Fred, Havana    
    James, (Varadero, Cuba) 
    Jack (Cuba) 
    Harry Varadero, Cuba    
    Josh Cuba   
    Gary, Mavinga & Other, Angola   
    Jamie, (Angola) 

    ================
    awk -F'\t' '{print NF"<"$1"><"$2"><"$3">"}' output.txt | cat -v

    1<Firstname><><>
    2<Fred, Havana><><>
    2<James, (Varadero, Cuba)><><>
    2<Jack (Cuba)><><>
    2<Harry Varadero, Cuba><><>
    2<Josh Cuba><><>
    2<Gary, Mavinga & Other, Angola><><>
    2<Jamie, (Angola)><><>

Solution

  • It sounds like this might be what you're looking for:

    $ cat tst.awk
    BEGIN { FS=OFS="\t" }
    NR==FNR {
        map[$1] = $2
        map[$2] = $2
        next
    }
    FNR==1 {
        print
        FS=" "
        next
    }
    {
        orig = $0
        country = ""
        gsub(/[^[:alpha:]]/," ")
        for (i=NF; i>0; i--) {
            if ($i in map) {
                country = map[$i]
                break
            }
        }
        print orig, country
    }
    
    $ awk -f tst.awk file1 file2
    Firstname       Country
    Fred, Havana    Cuba
    James, (Varadero, Cuba) Cuba
    Jack (Cuba)     Cuba
    Harry Varadero, Cuba    Cuba
    Josh Cuba       Cuba
    Gary, Mavinga & Other, Angola   Angola
    Jamie, (Angola) Angola