Search code examples
pythonpandasdataframedata-cleaning

Best way to clean phone numbers if they have different country codes


What I ideally want is data in this particular format. I've shown some below to get an idea

Country Code Phone_number Corrected Phone Number
GB 0131 496 0902 +441314960902
GB 0141 4960760 +441414960760
GB +44117 4960589 +441174960589
US 191.040.0076x2520 +11910400076x2520
DE 05532 56994 +49553256994
GB 01514960233
GB (0118)4960720 +441184960720
GB +44(0)114 4960045
DE +49(0)1743095110
GB +44(0)1632 960 720
GB +44117 496 0381

I haven't filled out all the corrected numbers but what i was attempted was, first to use regex to take out any parenthesis and dots from all phone numbers then replacing the 0 with the country code which will be taken from a dictionary i.e

{
    GB : +44
    DE : +49
    US : +1
} 

This hadn't worked for me since matching the parenthesis pattern, caused some issues.


Solution

  • The logic may change with rules of phone number format; but a logic that fits your requirement is as follows:

    • Remove special chars other than digits, + and letters used for extension e.g. x, ext (following keeps all alphabets).
    • Prefix ISD code by matching country code.
    • Remove 0 that follows ISD code.
    df = pd.DataFrame(data=[["GB","0131 496 0902"],["GB","0141 4960760"],["GB","+44117 4960589"],["US","191.040.0076x2520"],["DE","05532 56994"],["GB","01514960233"],["GB","(0118)4960720"],["GB","+44(0)114 4960045"],["DE","+49(0)1743095110"],["GB","+44(0)1632 960 720"],["GB","+44117 496 0381"]], columns=["Country Code","Phone_number"])
    isd_code_map = { "GB": "+44", "DE": "+49", "US": "+1" }
    
    def correct_phone_number(row):
      import re
      # Remove special chars other than digits, `+` and letters used for extension e.g. `x`, `ext` (following keeps all alphabets).
      result = re.sub("[^A-Za-z\d\+]", "", row["Phone_number"])
      
      # Prefix ISD code by matching country code.
      if not result.startswith(isd_code_map[row["Country Code"]]):
        result = isd_code_map[row["Country Code"]] + result
    
      # Remove `0` that follows ISD code.
      if result.startswith(isd_code_map[row["Country Code"]] + "0"):
        result = result.replace(isd_code_map[row["Country Code"]] + "0", isd_code_map[row["Country Code"]])
      return result
    
    df["Corrected Phone Number"] = df.apply(correct_phone_number, axis=1)
    
    print(df)
    

    Output:

       Country Code        Phone_number Corrected Phone Number
    0            GB       0131 496 0902          +441314960902
    1            GB        0141 4960760          +441414960760
    2            GB      +44117 4960589          +441174960589
    3            US   191.040.0076x2520      +11910400076x2520
    4            DE         05532 56994           +49553256994
    5            GB         01514960233          +441514960233
    6            GB       (0118)4960720          +441184960720
    7            GB   +44(0)114 4960045          +441144960045
    8            DE    +49(0)1743095110          +491743095110
    9            GB  +44(0)1632 960 720          +441632960720
    10           GB     +44117 496 0381          +441174960381