Search code examples
pythonexcelcountrycity

How can I use Python to turn country code into full name and infer the country name based on the city name on an Excel file?


I'm a beginner in Python.

Now I have 2 columns on my Excel file. One is country column and the other one is city column.

For the country column, most of the values are shown in country code and some of them are shown in country full name, while some values are U.S.A states code and less than 1% of them are blank.

For the city column, it clearly shows the full city name (not city code), while nearly 20% of them are blank.

How can I use Python to create a new column to show the full country name based on the country code and remain the same name if it shows the full country name in the country column, and show the U.S.A states code as the United States in the new column?

The tricky part is, in the country column, take CO as the example, Co can stand for Columbia and Colorado, I cannot be sure whether it's a country or a state at the beginning, but when I check the corresponding city name I can know it's a country or a state (ex: Longmont for Colorado, Bogota for Columbia). How can I avoid this issue in the new column and infer the full country name in the new column based on the corresponding city name?

I appreciate your help!


Solution

  • Explanation

    Coded the task using following logic.

    1. Process simple abbreviations such as U.S.
    2. A country length greater than 3
      1. Have Country and City
        • Find closest Country City Pair in Cities
      2. Country Only
        • Find closest country match in list of countries in two letter country codes
    3. Country length equals 3
      • Find country with 3 letter country codes
    4. Country length equals 2 (could be country or state code)
    5. Code does not exist in list of states
      • Must be country code, so look up country in two letter country codes
    6. Code does not exist in list of countries
      • Must be state code for USA, so country is United States
    7. Could be country or state code
      • Check if city with this as a state code
      • Check if city with this as a country code
      • Must be best match of these two possibilities

    Note: String matching uses fuzzy matching to allow for flexibility in spelling of names rapidfuzz library was used over fuzzywuzzy since its an order of magnitude faster

    Code

    import pandas as pd
    from rapidfuzz import fuzz
    
    def find_closest_country(country):
        ' Country with the closest name in list of countries in country code '
        ratios = [fuzz.partial_ratio(country, x) for x in alpha2.values()]
        rated_countries = [(info, r) for info, r in zip(alpha2.values(), ratios)]
        
        # Best match with shortest name
        return sorted(rated_countries, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
        
    def check_city_country(city, country):
        ' City, Country pair closest in list of cities '
        ratios = [fuzz.partial_ratio(city, x['name']) * fuzz.partial_ratio(country, x['country']) for x in cities]
        rated_cities = [(info, r) for info, r in zip(cities, ratios)]
        
        # Best match with shortest name
        return sorted(rated_cities, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
        
    def check_city_subregion(city, subregion):
        ' City, subresion pair closest in list of cities '
        ratios = [fuzz.partial_ratio(city, x['name']) * fuzz.partial_ratio(subregion, x['subcountry']) for x in cities]
        rated_cities = [(info, r) for info, r in zip(cities, ratios)]
        
        # Best match with shortest name
        return sorted(rated_cities, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
        
    def lookup(country, city):
        '''
            Finds country based upon country and city
            country - country name or country code
            city - name of city
        '''
        if country.lower() == 'u.s.':
            # Picks up common US acronym
            country = "US"
       
        if len(country) > 3:
            # Must be country since too long for abbreviation
            if city:
                # Find closest city country pair in list of cities
                city_info = check_city_country(city, country)
                if city_info:
                    return city_info[0]['country']
           
            # No city, so find closest country in list of countries (2 code abbreviations reverse lookup)
            countries = find_closest_country(country)
            if countries:
                return countries[0]
            
            return None
        elif len(country) == 3:
            # 3 letter abbreviation
            country = country.upper()
            return alpha3.get(country, None)
        
        elif len(country) == 2:
            # Two letter country abbreviation
            country = country.upper()
            if not country in states:
                # Not a state code, so lookup contry from code
                return alpha2.get(country, None)
            
            if not country in alpha2:
                # Not a country code, so must be state code for US
                return "United States of America"
            
            # Could be country of state code
            
            if city:
                # Have 2 digit code (could be country or state)
                pos_country = alpha2[country]  # possible country
                pos_state = states[country]    # possible state
                
                # check closest country with this city
                pos_countries = check_city_country(city, pos_country)
                
                # If state code, country would be United States
                pos_us = check_city_country(city, "United States")
                
                if pos_countries[1] > pos_us[1]:
                    # Provided better match as country code
                    return pos_countries[0]['country']
                else:
                    # Provided better match as state code (i.e. "United States")
                    return pos_us[0]['country']
            else:
                return alpha2[country]
                 
        else:
            return None
       
    

    Data

    # State Codes
    # https://gist.github.com/rugbyprof/76575b470b6772ce8fa0c49e23931d97
    states = {"AL":"Alabama","AK":"Alaska","AZ":"Arizona","AR":"Arkansas","CA":"California","CO":"Colorado","CT":"Connecticut","DE":"Delaware","FL":"Florida","GA":"Georgia","HI":"Hawaii","ID":"Idaho","IL":"Illinois","IN":"Indiana","IA":"Iowa","KS":"Kansas","KY":"Kentucky","LA":"Louisiana","ME":"Maine","MD":"Maryland","MA":"Massachusetts","MI":"Michigan","MN":"Minnesota","MS":"Mississippi","MO":"Missouri","MT":"Montana","NE":"Nebraska","NV":"Nevada","NH":"New Hampshire","NJ":"New Jersey","NM":"New Mexico","NY":"New York","NC":"North Carolina","ND":"North Dakota","OH":"Ohio","OK":"Oklahoma","OR":"Oregon","PA":"Pennsylvania","RI":"Rhode Island","SC":"South Carolina","SD":"South Dakota","TN":"Tennessee","TX":"Texas","UT":"Utah","VT":"Vermont","VA":"Virginia","WA":"Washington","WV":"West Virginia","WI":"Wisconsin","WY":"Wyoming"}
    
    # two letter country codes
    # https://gist.github.com/carlopires/1261951/d13ca7320a6abcd4b0aa800d351a31b54cefdff4
    alpha2 = {
        'AD': 'Andorra',
        'AE': 'United Arab Emirates',
        'AF': 'Afghanistan',
        'AG': 'Antigua & Barbuda',
        'AI': 'Anguilla',
        'AL': 'Albania',
        'AM': 'Armenia',
        'AN': 'Netherlands Antilles',
        'AO': 'Angola',
        'AQ': 'Antarctica',
        'AR': 'Argentina',
        'AS': 'American Samoa',
        'AT': 'Austria',
        'AU': 'Australia',
        'AW': 'Aruba',
        'AZ': 'Azerbaijan',
        'BA': 'Bosnia and Herzegovina',
        'BB': 'Barbados',
        'BD': 'Bangladesh',
        'BE': 'Belgium',
        'BF': 'Burkina Faso',
        'BG': 'Bulgaria',
        'BH': 'Bahrain',
        'BI': 'Burundi',
        'BJ': 'Benin',
        'BM': 'Bermuda',
        'BN': 'Brunei Darussalam',
        'BO': 'Bolivia',
        'BR': 'Brazil',
        'BS': 'Bahama',
        'BT': 'Bhutan',
        'BU': 'Burma (no longer exists)',
        'BV': 'Bouvet Island',
        'BW': 'Botswana',
        'BY': 'Belarus',
        'BZ': 'Belize',
        'CA': 'Canada',
        'CC': 'Cocos (Keeling) Islands',
        'CF': 'Central African Republic',
        'CG': 'Congo',
        'CH': 'Switzerland',
        'CI': 'Côte D\'ivoire (Ivory Coast)',
        'CK': 'Cook Iislands',
        'CL': 'Chile',
        'CM': 'Cameroon',
        'CN': 'China',
        'CO': 'Colombia',
        'CR': 'Costa Rica',
        'CS': 'Czechoslovakia (no longer exists)',
        'CU': 'Cuba',
        'CV': 'Cape Verde',
        'CX': 'Christmas Island',
        'CY': 'Cyprus',
        'CZ': 'Czech Republic',
        'DD': 'German Democratic Republic (no longer exists)',
        'DE': 'Germany',
        'DJ': 'Djibouti',
        'DK': 'Denmark',
        'DM': 'Dominica',
        'DO': 'Dominican Republic',
        'DZ': 'Algeria',
        'EC': 'Ecuador',
        'EE': 'Estonia',
        'EG': 'Egypt',
        'EH': 'Western Sahara',
        'ER': 'Eritrea',
        'ES': 'Spain',
        'ET': 'Ethiopia',
        'FI': 'Finland',
        'FJ': 'Fiji',
        'FK': 'Falkland Islands (Malvinas)',
        'FM': 'Micronesia',
        'FO': 'Faroe Islands',
        'FR': 'France',
        'FX': 'France, Metropolitan',
        'GA': 'Gabon',
        'GB': 'United Kingdom (Great Britain)',
        'GD': 'Grenada',
        'GE': 'Georgia',
        'GF': 'French Guiana',
        'GH': 'Ghana',
        'GI': 'Gibraltar',
        'GL': 'Greenland',
        'GM': 'Gambia',
        'GN': 'Guinea',
        'GP': 'Guadeloupe',
        'GQ': 'Equatorial Guinea',
        'GR': 'Greece',
        'GS': 'South Georgia and the South Sandwich Islands',
        'GT': 'Guatemala',
        'GU': 'Guam',
        'GW': 'Guinea-Bissau',
        'GY': 'Guyana',
        'HK': 'Hong Kong',
        'HM': 'Heard & McDonald Islands',
        'HN': 'Honduras',
        'HR': 'Croatia',
        'HT': 'Haiti',
        'HU': 'Hungary',
        'ID': 'Indonesia',
        'IE': 'Ireland',
        'IL': 'Israel',
        'IN': 'India',
        'IO': 'British Indian Ocean Territory',
        'IQ': 'Iraq',
        'IR': 'Islamic Republic of Iran',
        'IS': 'Iceland',
        'IT': 'Italy',
        'JM': 'Jamaica',
        'JO': 'Jordan',
        'JP': 'Japan',
        'KE': 'Kenya',
        'KG': 'Kyrgyzstan',
        'KH': 'Cambodia',
        'KI': 'Kiribati',
        'KM': 'Comoros',
        'KN': 'St. Kitts and Nevis',
        'KP': 'Korea, Democratic People\'s Republic of',
        'KR': 'Korea, Republic of',
        'KW': 'Kuwait',
        'KY': 'Cayman Islands',
        'KZ': 'Kazakhstan',
        'LA': 'Lao People\'s Democratic Republic',
        'LB': 'Lebanon',
        'LC': 'Saint Lucia',
        'LI': 'Liechtenstein',
        'LK': 'Sri Lanka',
        'LR': 'Liberia',
        'LS': 'Lesotho',
        'LT': 'Lithuania',
        'LU': 'Luxembourg',
        'LV': 'Latvia',
        'LY': 'Libyan Arab Jamahiriya',
        'MA': 'Morocco',
        'MC': 'Monaco',
        'MD': 'Moldova, Republic of',
        'MG': 'Madagascar',
        'MH': 'Marshall Islands',
        'ML': 'Mali',
        'MN': 'Mongolia',
        'MM': 'Myanmar',
        'MO': 'Macau',
        'MP': 'Northern Mariana Islands',
        'MQ': 'Martinique',
        'MR': 'Mauritania',
        'MS': 'Monserrat',
        'MT': 'Malta',
        'MU': 'Mauritius',
        'MV': 'Maldives',
        'MW': 'Malawi',
        'MX': 'Mexico',
        'MY': 'Malaysia',
        'MZ': 'Mozambique',
        'NA': 'Namibia',
        'NC': 'New Caledonia',
        'NE': 'Niger',
        'NF': 'Norfolk Island',
        'NG': 'Nigeria',
        'NI': 'Nicaragua',
        'NL': 'Netherlands',
        'NO': 'Norway',
        'NP': 'Nepal',
        'NR': 'Nauru',
        'NT': 'Neutral Zone (no longer exists)',
        'NU': 'Niue',
        'NZ': 'New Zealand',
        'OM': 'Oman',
        'PA': 'Panama',
        'PE': 'Peru',
        'PF': 'French Polynesia',
        'PG': 'Papua New Guinea',
        'PH': 'Philippines',
        'PK': 'Pakistan',
        'PL': 'Poland',
        'PM': 'St. Pierre & Miquelon',
        'PN': 'Pitcairn',
        'PR': 'Puerto Rico',
        'PT': 'Portugal',
        'PW': 'Palau',
        'PY': 'Paraguay',
        'QA': 'Qatar',
        'RE': 'Réunion',
        'RO': 'Romania',
        'RU': 'Russian Federation',
        'RW': 'Rwanda',
        'SA': 'Saudi Arabia',
        'SB': 'Solomon Islands',
        'SC': 'Seychelles',
        'SD': 'Sudan',
        'SE': 'Sweden',
        'SG': 'Singapore',
        'SH': 'St. Helena',
        'SI': 'Slovenia',
        'SJ': 'Svalbard & Jan Mayen Islands',
        'SK': 'Slovakia',
        'SL': 'Sierra Leone',
        'SM': 'San Marino',
        'SN': 'Senegal',
        'SO': 'Somalia',
        'SR': 'Suriname',
        'ST': 'Sao Tome & Principe',
        'SU': 'Union of Soviet Socialist Republics (no longer exists)',
        'SV': 'El Salvador',
        'SY': 'Syrian Arab Republic',
        'SZ': 'Swaziland',
        'TC': 'Turks & Caicos Islands',
        'TD': 'Chad',
        'TF': 'French Southern Territories',
        'TG': 'Togo',
        'TH': 'Thailand',
        'TJ': 'Tajikistan',
        'TK': 'Tokelau',
        'TM': 'Turkmenistan',
        'TN': 'Tunisia',
        'TO': 'Tonga',
        'TP': 'East Timor',
        'TR': 'Turkey',
        'TT': 'Trinidad & Tobago',
        'TV': 'Tuvalu',
        'TW': 'Taiwan, Province of China',
        'TZ': 'Tanzania, United Republic of',
        'UA': 'Ukraine',
        'UG': 'Uganda',
        'UM': 'United States Minor Outlying Islands',
        'US': 'United States of America',
        'UY': 'Uruguay',
        'UZ': 'Uzbekistan',
        'VA': 'Vatican City State (Holy See)',
        'VC': 'St. Vincent & the Grenadines',
        'VE': 'Venezuela',
        'VG': 'British Virgin Islands',
        'VI': 'United States Virgin Islands',
        'VN': 'Viet Nam',
        'VU': 'Vanuatu',
        'WF': 'Wallis & Futuna Islands',
        'WS': 'Samoa',
        'YD': 'Democratic Yemen (no longer exists)',
        'YE': 'Yemen',
        'YT': 'Mayotte',
        'YU': 'Yugoslavia',
        'ZA': 'South Africa',
        'ZM': 'Zambia',
        'ZR': 'Zaire',
        'ZW': 'Zimbabwe',
        'ZZ': 'Unknown or unspecified country',
    }
    
    # Three letter codes
    #https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3#Uses_and_applications
    alpha3 = """ABW  Aruba
    AFG  Afghanistan
    AGO  Angola
    AIA  Anguilla
    ALA  Åland Islands
    ALB  Albania
    AND  Andorra
    ARE  United Arab Emirates
    ARG  Argentina
    ARM  Armenia
    ASM  American Samoa
    ATA  Antarctica
    ATF  French Southern Territories
    ATG  Antigua and Barbuda
    AUS  Australia
    AUT  Austria
    AZE  Azerbaijan
    BDI  Burundi
    BEL  Belgium
    BEN  Benin
    BES  Bonaire, Sint Eustatius and Saba
    BFA  Burkina Faso
    BGD  Bangladesh
    BGR  Bulgaria
    BHR  Bahrain
    BHS  Bahamas
    BIH  Bosnia and Herzegovina
    BLM  Saint Barthélemy
    BLR  Belarus
    BLZ  Belize
    BMU  Bermuda
    BOL  Bolivia (Plurinational State of)
    BRA  Brazil
    BRB  Barbados
    BRN  Brunei Darussalam
    BTN  Bhutan
    BVT  Bouvet Island
    BWA  Botswana
    CAF  Central African Republic
    CAN  Canada
    CCK  Cocos (Keeling) Islands
    CHE  Switzerland
    CHL  Chile
    CHN  China
    CIV  Côte d'Ivoire
    CMR  Cameroon
    COD  Congo, Democratic Republic of the
    COG  Congo
    COK  Cook Islands
    COL  Colombia
    COM  Comoros
    CPV  Cabo Verde
    CRI  Costa Rica
    CUB  Cuba
    CUW  Curaçao
    CXR  Christmas Island
    CYM  Cayman Islands
    CYP  Cyprus
    CZE  Czechia
    DEU  Germany
    DJI  Djibouti
    DMA  Dominica
    DNK  Denmark
    DOM  Dominican Republic
    DZA  Algeria
    ECU  Ecuador
    EGY  Egypt
    ERI  Eritrea
    ESH  Western Sahara
    ESP  Spain
    EST  Estonia
    ETH  Ethiopia
    FIN  Finland
    FJI  Fiji
    FLK  Falkland Islands (Malvinas)
    FRA  France
    FRO  Faroe Islands
    FSM  Micronesia (Federated States of)
    GAB  Gabon
    GBR  United Kingdom of Great Britain and Northern Ireland
    GEO  Georgia
    GGY  Guernsey
    GHA  Ghana
    GIB  Gibraltar
    GIN  Guinea
    GLP  Guadeloupe
    GMB  Gambia
    GNB  Guinea-Bissau
    GNQ  Equatorial Guinea
    GRC  Greece
    GRD  Grenada
    GRL  Greenland
    GTM  Guatemala
    GUF  French Guiana
    GUM  Guam
    GUY  Guyana
    HKG  Hong Kong
    HMD  Heard Island and McDonald Islands
    HND  Honduras
    HRV  Croatia
    HTI  Haiti
    HUN  Hungary
    IDN  Indonesia
    IMN  Isle of Man
    IND  India
    IOT  British Indian Ocean Territory
    IRL  Ireland
    IRN  Iran (Islamic Republic of)
    IRQ  Iraq
    ISL  Iceland
    ISR  Israel
    ITA  Italy
    JAM  Jamaica
    JEY  Jersey
    JOR  Jordan
    JPN  Japan
    KAZ  Kazakhstan
    KEN  Kenya
    KGZ  Kyrgyzstan
    KHM  Cambodia
    KIR  Kiribati
    KNA  Saint Kitts and Nevis
    KOR  Korea, Republic of
    KWT  Kuwait
    LAO  Lao People's Democratic Republic
    LBN  Lebanon
    LBR  Liberia
    LBY  Libya
    LCA  Saint Lucia
    LIE  Liechtenstein
    LKA  Sri Lanka
    LSO  Lesotho
    LTU  Lithuania
    LUX  Luxembourg
    LVA  Latvia
    MAC  Macao
    MAF  Saint Martin (French part)
    MAR  Morocco
    MCO  Monaco
    MDA  Moldova, Republic of
    MDG  Madagascar
    MDV  Maldives
    MEX  Mexico
    MHL  Marshall Islands
    MKD  North Macedonia
    MLI  Mali
    MLT  Malta
    MMR  Myanmar
    MNE  Montenegro
    MNG  Mongolia
    MNP  Northern Mariana Islands
    MOZ  Mozambique
    MRT  Mauritania
    MSR  Montserrat
    MTQ  Martinique
    MUS  Mauritius
    MWI  Malawi
    MYS  Malaysia
    MYT  Mayotte
    NAM  Namibia
    NCL  New Caledonia
    NER  Niger
    NFK  Norfolk Island
    NGA  Nigeria
    NIC  Nicaragua
    NIU  Niue
    NLD  Netherlands
    NOR  Norway
    NPL  Nepal
    NRU  Nauru
    NZL  New Zealand
    OMN  Oman
    PAK  Pakistan
    PAN  Panama
    PCN  Pitcairn
    PER  Peru
    PHL  Philippines
    PLW  Palau
    PNG  Papua New Guinea
    POL  Poland
    PRI  Puerto Rico
    PRK  Korea (Democratic People's Republic of)
    PRT  Portugal
    PRY  Paraguay
    PSE  Palestine, State of
    PYF  French Polynesia
    QAT  Qatar
    REU  Réunion
    ROU  Romania
    RUS  Russian Federation
    RWA  Rwanda
    SAU  Saudi Arabia
    SDN  Sudan
    SEN  Senegal
    SGP  Singapore
    SGS  South Georgia and the South Sandwich Islands
    SHN  Saint Helena, Ascension and Tristan da Cunha
    SJM  Svalbard and Jan Mayen
    SLB  Solomon Islands
    SLE  Sierra Leone
    SLV  El Salvador
    SMR  San Marino
    SOM  Somalia
    SPM  Saint Pierre and Miquelon
    SRB  Serbia
    SSD  South Sudan
    STP  Sao Tome and Principe
    SUR  Suriname
    SVK  Slovakia
    SVN  Slovenia
    SWE  Sweden
    SWZ  Eswatini
    SXM  Sint Maarten (Dutch part)
    SYC  Seychelles
    SYR  Syrian Arab Republic
    TCA  Turks and Caicos Islands
    TCD  Chad
    TGO  Togo
    THA  Thailand
    TJK  Tajikistan
    TKL  Tokelau
    TKM  Turkmenistan
    TLS  Timor-Leste
    TON  Tonga
    TTO  Trinidad and Tobago
    TUN  Tunisia
    TUR  Turkey
    TUV  Tuvalu
    TWN  Taiwan, Province of China
    TZA  Tanzania, United Republic of
    UGA  Uganda
    UKR  Ukraine
    UMI  United States Minor Outlying Islands
    URY  Uruguay
    USA  United States of America
    UZB  Uzbekistan
    VAT  Holy See
    VCT  Saint Vincent and the Grenadines
    VEN  Venezuela (Bolivarian Republic of)
    VGB  Virgin Islands (British)
    VIR  Virgin Islands (U.S.)
    VNM  Viet Nam
    VUT  Vanuatu
    WLF  Wallis and Futuna
    WSM  Samoa
    YEM  Yemen
    ZAF  South Africa
    ZMB  Zambia
    ZWE  Zimbabwe"""
    
    # Convert to dictionary
    alpha3 = dict(tuple(re.split(r" {2,}", s)) for s in alpha3.split('\n'))
    
    # List of World Cities & Country
    # cities https://pkgstore.datahub.io/core/world-cities/world-cities_csv/data/6cc66692f0e82b18216a48443b6b95da/world-cities_csv.csv
    # Online CSV File
    
    import csv
    import urllib.request
    import io
    
    def csv_import(url):
        url_open = urllib.request.urlopen(url)
        csvfile = csv.DictReader(io.StringIO(url_open.read().decode('utf-8')), delimiter=',') 
        return csvfile
    
    url = 'https://pkgstore.datahub.io/core/world-cities/world-cities_csv/data/6cc66692f0e82b18216a48443b6b95da/world-cities_csv.csv'
    
    cities = csv_import(url)
    

    Test

    Excel File (Input)

    country city
    u.s.    
    DZ  
    AS  
    co  Longmont
    co  Bogota
    AL  
    AL  Huntsville
    usa 
    AFG 
    BLR Minsk
    AUS 
    united states   
    Korea   seoul
    Korea   Pyongyang
    

    Test Code

    df = pd.read_excel('country_test.xlsx') # Load Excel File
    df.fillna('', inplace=True)
    
    # Get name of country based upon country and city
    df['country_'] = df.apply(lambda row: lookup(row['country'], row['city']), axis = 1)
    

    Resulting Dataframe

           country        city                  country_
    0            u.s.              United States of America
    1              DZ                               Algeria
    2              AS                        American Samoa
    3              co    Longmont             United States
    4              co      Bogota                  Colombia
    5              AL                               Albania
    6              AL  Huntsville             United States
    7             usa              United States of America
    8             AFG                           Afghanistan
    9             BLR       Minsk                   Belarus
    10            AUS                             Australia
    11  united states              United States of America
    12          Korea       seoul               South Korea
    13          Korea   Pyongyang               North Korea