Search code examples
pythonpandasdataframestartswith

Python Pandas compare two dataframes to assign country to phone number


I have two dataframes that I read in via csv. Dataframe one consists of a phone number and some additional data. The second dataframe contains country codes and country names.

I want to take the phone number from the first dataset and compare it to the country codes of the second. Country codes can between one to four digits long. I go from the longest country code to the shortest. If there is a match, I want to assign the country name to the phone number.

Input longlist:

phonenumber, add_info    
34123425209, info1
92654321762, info2
12018883637, info3
6323450001, info4
496789521134, info5

Input country_list:

country;country_code;order_info
Spain;34;1
Pakistan;92;4
USA;1;2
Philippines;63;3
Germany;49;4
Poland;48;1
Norway;47;2

Output should be:

phonenumber, add_info, country, order_info    
34123425209, info1, Spain, 1
92654321762, info2, Pakistan, 4
12018883637, info3, USA, 2
6323450001, info4, Philippines, 3
496789521134, info5, Germany, 4

I have it solved once like this:

#! /usr/bin/python
import csv
import pandas
with open ('longlist.csv','r') as lookuplist:
with open ('country_list.csv','r') as inputlist:
    with open('Outputfile.csv', 'w') as outputlist:
        reader = csv.reader(lookuplist, delimiter=',')
        reader2 = csv.reader(inputlist, delimiter=';')
        writer = csv.writer(outputlist, dialect='excel')

        for i in reader2:
            for xl in reader:
                if xl[0].startswith(i[1]):
                    zeile = [xl[0], xl[1], i[0], i[1], i[2]]
                    writer.writerow(zeile)
            lookuplist.seek(0)

But I would like to solve this problem, using pandas. What I got to work: - Read in the csv files - Remove duplicates from "longlist" - Sort list of countries / country code

This is, what I have working already:

import pandas as pd, numpy as np
longlist = pd.read_csv('path/to/longlist.csv', 
                                 usecols=[2,3], names=['PHONENUMBER','ADD_INFO'])
country_list = pd.read_csv('path/to/country_list.csv', 
                           sep=';', names=['COUNTRY','COUNTRY_CODE','ORDER_INFO'], skiprows=[0])

# remove duplicates and make phone number an index
longlist = longlist.drop_duplicates('PHONENUMBER')
longlist = longlist.set_index('PHONENUMBER')

# Sort country list, from high to low value and make country code an index
country_list=country_list.sort_values(by='COUNTRY_CODE', ascending=0)
country_list=country_list.set_index('COUNTRY_CODE')

(...)

longlist.to_csv('path/to/output.csv')

But any way trying the same with datasets does not work. I cannot apply startswith (cannot iterate through objects and cannot apply it on objects). I would really appreciate your help.


Solution

  • i would do it this way:

    cl = pd.read_csv('country_list.csv', sep=';', dtype={'country_code':str})
    ll = pd.read_csv('phones.csv', skipinitialspace=True, dtype={'phonenumber':str})
    
    lookup = cl['country_code']
    lookup.index = cl['country_code']
    
    ll['country_code'] = (
        ll['phonenumber']
        .apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
                                    lookup.get(x[:2]), lookup.get(x[:1])]))
        .apply(lambda x: x.get(x.first_valid_index()), axis=1)
    )
    
    # remove `how='left'` parameter if you don't need "unmatched" phone-numbers    
    result = ll.merge(cl, on='country_code', how='left')
    

    Output:

    In [195]: result
    Out[195]:
        phonenumber add_info country_code      country  order_info
    0   34123425209    info1           34        Spain         1.0
    1   92654321762    info2           92     Pakistan         4.0
    2   12018883637    info3            1          USA         2.0
    3   12428883637   info31         1242      Bahamas         3.0
    4    6323450001    info4           63  Philippines         3.0
    5  496789521134    info5           49      Germany         4.0
    6   00000000000      BAD         None          NaN         NaN
    

    Explanation:

    In [216]: (ll['phonenumber']
       .....:   .apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
       .....:                               lookup.get(x[:2]), lookup.get(x[:1])]))
       .....: )
    Out[216]:
          0     1     2     3
    0  None  None    34  None
    1  None  None    92  None
    2  None  None  None     1
    3  1242  None  None     1
    4  None  None    63  None
    5  None  None    49  None
    6  None  None  None  None
    

    phones.csv: - i've intentionally added one Bahamas number (1242...) and one invalid number (00000000000)

    phonenumber, add_info
    34123425209, info1
    92654321762, info2
    12018883637, info3
    12428883637, info31
    6323450001, info4
    496789521134, info5
    00000000000, BAD