Search code examples
pythonlistcomparisoncountrypycountry-convert

Comparing dictionary of countries with a column of countries


I have a dictionary/country list which i took from pycountry (list(pycountry.countries)

I am importing a column from excel called CountryNames

i thought by creating 2 for loops, i could then compare to see if my CountryNames is found in the PyCountry list

 import pandas as pd
from langdetect import detect 
from googletrans import Translator
import pycountry

   df =  pd.read_excel(r"V1.xlsx")
    input_countries = df['CountryName']
        
        # Ensure the column 'CountryName' exists in the CSV
    if 'Sold To Customer Country' not in df.columns:
            raise ValueError("CountryName' column")
        
    t = list(pycountry.countries)
    
    # loop through column of countries
    for name in input_countries:
        # loop through country name (pycountry)
         for pycountry.name in t:
              # comparison - Show country names 
              if input_countries in t:
                   input_countries
              else:
                   'invalid'enter code here

I want to check to see if a countryName (from the excel column) is in the pycountry list


Solution

  • You need two sets. One is a set formed from all entries in the "CountryName" column in the spreadsheet. The other is a combination of all known country names both in their short forms and "official" formats.

    Then you just need a simple loop to validate your data.

    Something like this should suffice:

    import pycountry
    import pandas as pd
    from pathlib import Path
    
    FILENAME = "V1.xlsx"
    
    def all_names() -> set[str]:
        # all Country objects have a "name" attribute
        return {country.name for country in pycountry.countries} # type: ignore
    
    def all_official_names() -> set[str]:
        s: set[str] = set()
        for country in pycountry.countries:
            # not all Country objects have an "official_name" attribute
            try:
                s.add(country.official_name) # type: ignore
            except AttributeError:
                pass
        return s
    
    def get_df_countries(filename: Path) -> set[str]:
        # construct a set because country names may be duplicated in the spreadsheet column
        # this potentially improves runtime performance when parsing the country names later
        return set(pd.read_excel(filename)["CountryName"])
    
    if __name__ == "__main__":
        names = all_names() | all_official_names()
        for country in get_df_countries(Path(FILENAME)):
            status = "valid" if country in names else "invalid"
            print(f"{country} is {status}")
    

    Caveat Emptor

    You're limited to 2 forms of country names provided by the pycountry module. Due to typing errors and other subtle variations of country names you may get false negatives from this code