Hej!
I am working on some data analysis of country statistics. I now use data from different sources, and as soon has been seen, countries get called differently some times: while the World Bank calls it "United Kingdom and Northern Ireland", the WHO might call it simply "United Kingdom" and mean the same political construct (I am aware of the fact that England, Scotland and Wales are the "countries" not really UK).
I created a dictionary where I took most of the different names to standardise it to World Bank Data. This works like a charm in a list, but I need it in a pandas DataFrame, that I get from pd.read_csv
.
In example: If I have a very short dictionary
dict = {'US': 'USA'}
how can I translate this within my dataframe (set the column in df.country
to the dict.key
value)?
Showing it in example:
ID country val
1 US some values
to:
ID country val
1 USA some values
For my list conversion I used the following construct, where listB
is the input and output list:
for key in dict:
listB = [w.replace(key, dict[key]) for w in listB]
Any suggestions how to do this the most easily? Any help would be awesome!
P.S: On a further note, does anyone have an idea how to generate ISO 3166-1 alpha-3 codes (like Germany = GER, Sweden = SWE and so on?). That might be an extension of the question above.
Use replace
:
df['country'] = df['country'].replace(dic)
And for ISO 3166-1 alpha-3 check answers.
I think simpliest is download it from here.
If want parse code from wikipedia
is possible use this solution or rewrited for DataFrame
in python 3
:
from bs4 import BeautifulSoup
import requests
url = "http://en.wikipedia.org/wiki/ISO_3166-1"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
t = soup.findAll('table', {'class' : 'wikitable sortable'})[1]
L = []
cs = [th.findAll(text=True)[0] for th in t.findAll('th')]
for row in t.findAll("tr")[1:]:
tds = row.findAll('td')
raw_cols = [td.findAll(text=True) for td in tds]
cols = []
# country field contains differing numbers of elements, due to the flag --
# only take the name
cols.append(raw_cols[0][-1:][0])
# for all other columns, use the first result text
cols.extend([col[0] for col in raw_cols[1:]])
L.append(cols)
df = pd.DataFrame(L, columns=cs)
print (df.head())
English short name (upper/lower case) Alpha-2 code Alpha-3 code \
0 Afghanistan AF AFG
1 Åland Islands AX ALA
2 Albania AL ALB
3 Algeria DZ DZA
4 American Samoa AS ASM
Numeric code Link to Independent
0 004 ISO 3166-2:AF Yes
1 248 ISO 3166-2:AX No
2 008 ISO 3166-2:AL Yes
3 012 ISO 3166-2:DZ Yes
4 016 ISO 3166-2:AS No