Search code examples
pythongeolocationcountrysanitizationgeograpy

Extracting country information from description using geograpy


PROBLEM: I want to extract country information from a user description. So far, I'm giving a try with the geograpy package. I like the behavior when the input is not very clear for example in Evesham or Rochdale, however, the package interprets some strings like Zaragoza, Spain as two mentions while the user is clearing saying that its location is in Spain. Still, I don't know why amsterdam is not giving as output Holland... How can I improve the outputs? Am I missing anything important? Is there a better package to achieve this?

DATA: My data example is:

                   user_location
2  Socialist Republic of Alachua
3                Hérault, France
4                 Gwalior, India
5                Zaragoza,España
7                     amsterdam 
8                        Evesham
9                       Rochdale

I want to get something like this:

                   user_location country
2  Socialist Republic of Alachua ['USSR', 'United States']
3                Hérault, France ['France']
4                 Gwalior, India ['India'] 
5                Zaragoza,España ['Spain']
7                     amsterdam  ['Holland']
8                        Evesham ['United Kingdom']
9                       Rochdale ['United Kingdom', 'United States']

REPREX:

import pandas as pd
import geograpy3

df = pd.DataFrame.from_dict({'user_location': {2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}})

df['country'] = df['user_location'].apply(lambda x: geograpy.get_place_context(text=x).countries if pd.notnull(x) else x)

print(df)
#>                    user_location                                            country
#> 2  Socialist Republic of Alachua  [USSR, Union of Soviet Socialist Republics, Al...
#> 3                Hérault, France                                  [France, Hérault]
#> 4                 Gwalior, India   [British Indian Ocean Territory, Gwalior, India]
#> 5                Zaragoza,España             [Zaragoza, España, Spain, El Salvador]
#> 7                     amsterdam                                                  []
#> 8                        Evesham                          [Evesham, United Kingdom]
#> 9                       Rochdale          [Rochdale, United Kingdom, United States]

Created on 2020-06-02 by the reprexpy package


Solution

  • geograpy3 was not behaving correctly anymore regarding country lookup since it didn't check if None was returned by pycountry. As a committer i just fixed this. I have added your slightly modified example (to avoid the pandas import) as a unit test case:

    def testStackoverflow62152428(self):
            '''
            see https://stackoverflow.com/questions/62152428/extracting-country-information-from-description-using-geograpy?noredirect=1#comment112899776_62152428
            '''
            examples={2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}  
            for index,text in examples.items():
                places=geograpy.get_geoPlace_context(text=text)
                print("example %d: %s" % (index,places.countries))
    

    and the result is now:

    example 2: ['United States']
    example 3: ['France']
    example 4: ['British Indian Ocean Territory', 'India']
    example 5: ['Spain', 'El Salvador']
    example 7: []
    example 8: ['United Kingdom']
    example 9: ['United Kingdom', 'United States']
    

    indeed there is room for improvement for example 5. I have added an issue https://github.com/somnathrakshit/geograpy3/issues/7 - please stay tuned ...