Search code examples
python-3.xpandasgisgeopandas

Extracting countries from string


I am trying to go through a column of data frame in python 3. What I need to do is take from each row the country that it is mentioned and the number of times that country is mentioned. i.e. if I have this row:

['[Aydemir, Deniz', ' Gunduz, Gokhan', ' Asik, Nejla] Bartin Univ, Fac Forestry, Dept Forest Ind Engn, TR-74100 Bartin, Turkey', ' [Wang, Alice] Lulea Univ Technol, Wood Technol, Skelleftea, Sweden']

it needs to output a list: ['Turkey', 'Sweden']

and if I have this row:

['[Fang, Qun', ' Cui, Hui-Wang] Zhejiang A&F Univ, Sch Engn, Linan 311300, Peoples R China', ' [Du, Guan-Ben] Southwest Forestry Univ, Kunming 650224, Yunnan, Peoples R China']

the output should be: ['China', 'China'].

I have written this code but it is not working as I want to:

from geotext import GeoText
sentence = df.iloc[0,0]
places = GeoText(sentence)
print(places.countries)

It prints only the country once and in some cases when it is USA it doesn't recognize the abbreviation. Can you help me figure out what to do?

l = [['[Aydemir, Deniz\', \' Gunduz, Gokhan\', \' Asik, Nejla] Bartin Univ, Fac Forestry, Dept Forest Ind Engn, TR-74100 Bartin, Turkey\', \' [Wang, Alice] Lulea Univ Technol, Wood Technol, Skelleftea, Sweden',1990],
 ['[Fang, Qun\', \' Cui, Hui-Wang] Zhejiang A&F Univ, Sch Engn, Linan 311300, Peoples R China\', \' [Du, Guan-Ben] Southwest Forestry Univ, Kunming 650224, Yunnan, Peoples R China',2005],
 ['[Blumentritt, Melanie\', \' Gardner, Douglas J.\', \' Shaler, Stephen M.] Univ Maine, Sch Resources, Orono, ME USA\', \' [Cole, Barbara J. W.] Univ Maine, Dept Chem, Orono, ME 04469 USA',2012]]
dataf = pd.DataFrame(l, columns = ['Authors', 'Year'])

I tried to do this code but I have the same problem, it doesn't give all the counties only one per row:

def find_country(n):
    for c in pycountry.countries:
        if str(c.name).lower() in n.lower():
            return c.name
country1 = (dataf['Authors']
  .replace(r"\bUSA\b", "United States", regex=True)
  .apply(lambda x: find_country(x)))

Solution

  • USA does not seem to be detected correctly by geotext - it's worth trying to raise an issue with that package. As a workaround here, I replace USA with United States, which is correctly detected.

    df = (dataf['Authors']
          .replace(r"\bUSA\b", "United States", regex=True)
          .apply(lambda x: geotext.GeoText(x).countries)
    )
    

    I'm not sure what you were doing before, but this will get the list of countries for each of the rows in Author, including duplicates.

    0                  [Turkey, Sweden]
    1                    [China, China]
    2    [United States, United States]
    Name: Authors, dtype: object
    

    As mentioned in the comment, if you want to have an actual list of lists, just add tolist() to the end.

    df.tolist()
    
    [['Turkey', 'Sweden'], ['China', 'China'], ['United States', 'United States']]