Search code examples
pythonpandasregexdataframetext-extraction

Pandas Extract Phone Number if it is in Correct Format


I have a column that has phone numbers. They are usually formatted in (555) 123-4567 but sometimes they are in a different format or they are not proper numbers. I am trying to convert this field to have just the numbers, removing any non-numeric characters (if there are 10 numbers).

How can I apply a function that says if there are 10 numbers in this field, extract just the numbers?

I tried to use:

df['PHONE'] = df['PHONE'].str.extract('(\d+)', expand=False)

But this just extracts the first chunk of numbers (the area code). How do I pull all the numbers and only run this extraction if there are exactly 10 numbers in the field?

My expected output would be 5551234567


Solution

  • Figured it out. I created a function that I apply to my phone # field

    def extractNums(number):
        new_number = list(filter(str.isnumeric, number))
        if len(new_number) == 10:
            return "".join(new_number)
        else:
            return number
    
    df['PHONE'] = df['PHONE'].apply(extractNums)