Search code examples
python-3.xpandasdataframegeolocation

Unable to get country name from bunch of IP address in pandas dataframe


I have a pandas dataframe df_test consisting of IP address like below :

     |  cs-username |   c-ip      |
     +--------------+-------------+
     |-             | 70.80.84.76 |           
     |-             | 70.80.84.76 |
     |-             | 70.80.84.76 |
     |-             | 70.80.84.76 |

My goal is to get the name of country from each of IP address,and I have used DbIpCity from ip2geotools.So I have written code like below.

from ip2geotools.databases.noncommercial import DbIpCity

#Your code
df_test['Country'] = df_test.apply(lambda row: DbIpCity.get(row['c-ip'],api_key='free').country, axis=1)

However this results in error like below :

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-3772268ef132> in <module>()
      2 
      3 #Your code
----> 4 df_test['Country'] = df_test.apply(lambda row: DbIpCity.get(row['c-ip'],api_key='free').country, axis=1)

5 frames
/usr/local/lib/python3.7/dist-packages/ip2geotools/databases/noncommercial.py in get(ip_address, api_key, db_path, username, password)
     65         # format data
     66         ip_location.country = content['countryCode']
---> 67         ip_location.region = content['stateProv']
     68         ip_location.city = content['city']
     69 

KeyError: 'stateProv'

The code is in the below colab link (last cell) in case of reference: https://colab.research.google.com/drive/1zz1LZ2uOAp1YsX0x0CJfvcM21XGkeCO5?usp=sharing

So how can I resolve this error ?

Thanks


Solution

  • The program throws a KeyError when it can't get any data about the IP address. To avoid the script from stopping, you could use an exception. But because the ip2geotools library has a request limit, I decided to go with geolocation-db instead: (I used a for loop instead of lambda)

    import pandas as pd
    import numpy as np
    import urllib.request
    import json
    
    df = pd.read_csv('temp.csv')
    countries = []
    ips = []
    
    # Get Country info from https://geolocation-db.com
    def getCountry(ip):
      with urllib.request.urlopen("https://geolocation-db.com/jsonp/"+ip) as url:
        data = url.read().decode()
        data = data.split("(")[1].strip(")")
        return json.loads(data)['country_name']
    
    for index, row in df.iterrows():
        # Get IP data
        data = row['c-ip']
        if data not in ips:
            print(data)
            ips.append(data)
            #response = DbIpCity.get(row['c-ip'], api_key='free')
            response = getCountry(row['c-ip'])
            if response != None:
                print(response)
    
                # Add to country list
                countries.append(response)
            
            # If contry is None, add np.nan instead of None
            else:
                print(np.nan)
                countries.append(np.nan)
    
    # Insert all data into a new df
    ips = {'ip': ips,
           'country': countries, 
           }
    
    df_ips = pd.DataFrame(ips, columns = ['ip', 'country'])    
    print(df_ips)
    

    And because your CSV file is soo huge, use a filter to avoid the processing of duplicate IPs.

    And I found these errors in your Log:

    ERROR: geoip2 4.1.0 has requirement requests<3.0.0,>=2.24.0, but you'll have requests 2.23.0 which is incompatible.
    ERROR: geoip2 4.1.0 has requirement urllib3<2.0.0,>=1.25.2, but you'll have urllib3 1.24.3 which is incompatible.
    

    Try doing pip install --upgrade requests urllib3. You might have to upgrade them.