Search code examples
pythonpandasdataframesubstring

How to print just selected substrings (contained in a dataframe column) setting conditions with pandas


Hy everyone. I have a huge dataset in which I have several nations indicated by ISO-codes. Anyway, there are some nations that are shown with their official name but not with ISO code. I want to find them and then replace them with respective iso-codes.

This is the example of the df I have:

| TERRITORY               |
 -----------------------
| IT, GB, USA, France     |
| ES, Russia, Germany, PT |
| EG, LY, DZ              |

Expected output:

'The nations that were not converted are:' France, Russia, Germany

The big problem is that these nations are in the same cells and are considered as a single value. I thought to ask the program to print just that substrings which are longer than two characters, but after different attempts, I didn't get anything.

Could someone help me?


Solution

  • IIUC, you could split+explode and map to a known list of codes (here using pycountry):

    import pycountry
    codes = {c.alpha_2 for c in pycountry.countries}
    # or manually set
    # codes = {'IT', 'GB', 'USA', 'FR'...}
    
    s = df['TERRITORY'].str.split(', ').explode().drop_duplicates()
    print(f'The nations that were not converted are: {", ".join(s[~s.isin(codes)])}')
    

    output:

    The nations that were not converted are: USA, France, Russia, Germany