I have a list of countries that should turn into a DataFrame. Problem is that every country and the data is a sepparate word in the list. Example:
[
'Viet',
'Nam',
'0',
'12.3',
'0',
'Brunei',
'Darussalam',
'12',
'1.1',
'0',
'Bosnia',
'and',
'Herzegovina',
'2',
'2.1',
'0',
'Not',
'applicable',
'Turkey',
'4',
'4.3',
'0',
'Only',
'partial',
'coverage'
...
]
How to convert this into: [ ['Viet Nam', '0', '12.3', '0'], ['Brunei Darussalam', '12', '1.1', ...], ... ] or `pd.DataFrame:
country coef1 coef2 grade
0 Viet Nam 0 12.3 0
1 Brunei Darussalam 12 1.1 0
NOTE: Some countries have one word like China, France or three or more words like Republic of Korea. Also, sometimes after this series of number there can be a remark.
Try this:
Where data_in is data you want to parse and countries is a list of all countries of the world
import pandas as pd
import re
countries = ["Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua and Barbuda", "Argentina", "Armenia" ...]
data_in = [
'Viet', 'Nam', '0', '12.3', '0', 'Brunei', 'Darussalam', '12', '1.1', '0', 'Bosnia', 'and', 'Herzegovina', '2', '2.1', '0', 'Not', 'applicable', 'Turkey', '4', '4.3', '0'
]
data_out = []
country = coef1 = coef2 = grade = []
def is_country(elem):
isCountry = False
for country in countries:
if elem.lower() in country.lower():
isCountry = True
break
return isCountry
def is_num(elem):
if re.search(r'\d', elem) is not None:
return True
else:
return False
idx = 0
while idx < (len(data_in)):
elem = data_in[idx]
country = ''
elements = []
is_country_name = False
data_out_local = []
if is_country(elem):
#
while (not is_num(elem) and idx < len(data_in)):
country += elem + " "
idx += 1
elem = data_in[idx]
while(is_num(elem) and idx < len(data_in)):
elements.append(elem)
idx += 1
if idx < len(data_in):
elem = data_in[idx]
data_out_local.append(country)
data_out_local.extend(elements)
data_out.append(data_out_local)
idx += 1
df = pd.DataFrame(data_out, columns=['country', 'coef1', 'coef1', 'grade'])
print(df)
pandas.DataFrame output:
country coef1 coef1 grade
0 Viet Nam 0 12.3 0
1 Bosnia and Herzegovina 2 2.1 0
2 Turkey 4 4.3 0
Nonstandard solution, but it works