Search code examples
pythonpandasdataframenaming

Country Data: Renaming countries according to dict in pandas


Hej!

I am working on some data analysis of country statistics. I now use data from different sources, and as soon has been seen, countries get called differently some times: while the World Bank calls it "United Kingdom and Northern Ireland", the WHO might call it simply "United Kingdom" and mean the same political construct (I am aware of the fact that England, Scotland and Wales are the "countries" not really UK).

I created a dictionary where I took most of the different names to standardise it to World Bank Data. This works like a charm in a list, but I need it in a pandas DataFrame, that I get from pd.read_csv. In example: If I have a very short dictionary

dict = {'US': 'USA'}

how can I translate this within my dataframe (set the column in df.country to the dict.key value)?

Showing it in example:

ID  country   val
1   US        some values

to:

ID  country  val
1   USA      some values

For my list conversion I used the following construct, where listB is the input and output list:

for key in dict:
    listB = [w.replace(key, dict[key]) for w in listB]

Any suggestions how to do this the most easily? Any help would be awesome!

P.S: On a further note, does anyone have an idea how to generate ISO 3166-1 alpha-3 codes (like Germany = GER, Sweden = SWE and so on?). That might be an extension of the question above.


Solution

  • Use replace:

    df['country'] = df['country'].replace(dic)
    

    And for ISO 3166-1 alpha-3 check answers.

    I think simpliest is download it from here.

    If want parse code from wikipedia is possible use this solution or rewrited for DataFrame in python 3:

    from bs4 import BeautifulSoup
    import requests
    
    url = "http://en.wikipedia.org/wiki/ISO_3166-1"
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")
    
    t = soup.findAll('table', {'class' : 'wikitable sortable'})[1]
    L = []
    cs = [th.findAll(text=True)[0] for th in t.findAll('th')]
    
    for row in t.findAll("tr")[1:]:
        tds = row.findAll('td')
        raw_cols = [td.findAll(text=True) for td in tds]
        cols = []
        # country field contains differing numbers of elements, due to the flag -- 
        # only take the name
        cols.append(raw_cols[0][-1:][0])
        # for all other columns, use the first result text
        cols.extend([col[0] for col in raw_cols[1:]])
        L.append(cols)
    
    df = pd.DataFrame(L, columns=cs)
    

    print (df.head())
      English short name (upper/lower case) Alpha-2 code Alpha-3 code  \
    0                           Afghanistan           AF          AFG   
    1                         Åland Islands           AX          ALA   
    2                               Albania           AL          ALB   
    3                               Algeria           DZ          DZA   
    4                        American Samoa           AS          ASM   
    
      Numeric code       Link to  Independent  
    0          004  ISO 3166-2:AF         Yes  
    1          248  ISO 3166-2:AX          No  
    2          008  ISO 3166-2:AL         Yes  
    3          012  ISO 3166-2:DZ         Yes  
    4          016  ISO 3166-2:AS          No