Search code examples
pythonlistparsingdata-conversion

Parsing Python list into pandas.DataFrame with keywords


I have a list of countries that should turn into a DataFrame. Problem is that every country and the data is a sepparate word in the list. Example:

[
 'Viet',
 'Nam',
 '0',
 '12.3',
 '0',
 'Brunei',
 'Darussalam',
 '12',
 '1.1',
 '0',
 'Bosnia',
 'and',
 'Herzegovina',
 '2',
 '2.1',
 '0',
 'Not',
 'applicable',
 'Turkey',
 '4',
 '4.3',
 '0',
 'Only',
 'partial',
 'coverage'
...
]

How to convert this into: [ ['Viet Nam', '0', '12.3', '0'], ['Brunei Darussalam', '12', '1.1', ...], ... ] or `pd.DataFrame:

             country  coef1  coef2  grade
0           Viet Nam      0   12.3      0
1  Brunei Darussalam     12    1.1      0

NOTE: Some countries have one word like China, France or three or more words like Republic of Korea. Also, sometimes after this series of number there can be a remark.


Solution

  • Try this:

    Where data_in is data you want to parse and countries is a list of all countries of the world

    import pandas as pd
    import re
    
    countries = ["Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua and Barbuda", "Argentina", "Armenia" ...]
    
    data_in = [
        'Viet', 'Nam', '0', '12.3', '0', 'Brunei', 'Darussalam', '12', '1.1', '0', 'Bosnia', 'and', 'Herzegovina', '2', '2.1', '0', 'Not', 'applicable', 'Turkey', '4', '4.3', '0'
    ]
    
    data_out = []
    
    country = coef1 = coef2 = grade = []
    
    def is_country(elem):
      isCountry = False
      for country in countries:
        if elem.lower() in country.lower():
          isCountry = True
          break
      return isCountry
    
    def is_num(elem):
      if re.search(r'\d', elem) is not None:
        return True
      else:
        return False
    
    idx = 0
    while idx < (len(data_in)):
      elem = data_in[idx]
      country = ''
      elements = []
      is_country_name = False
      data_out_local = []
      if is_country(elem):
        #
        while (not is_num(elem) and idx < len(data_in)):
          country += elem + " "
          idx += 1
          elem = data_in[idx]
        while(is_num(elem) and idx < len(data_in)):
          elements.append(elem)
          idx += 1
          if idx < len(data_in):
            elem = data_in[idx]
        data_out_local.append(country)
        data_out_local.extend(elements)
        data_out.append(data_out_local)
      idx += 1
    
    
    df = pd.DataFrame(data_out, columns=['country', 'coef1', 'coef1', 'grade'])
    print(df)
    

    pandas.DataFrame output:

                       country coef1 coef1 grade
    0                Viet Nam      0  12.3     0
    1  Bosnia and Herzegovina      2   2.1     0
    2                  Turkey      4   4.3     0
    

    Nonstandard solution, but it works