Search code examples
pythonregexstringsymbols

Replace values in a column based on conditions derived from patterns (possibly Regular expressions if needed)


I have a column of home addresses that needs to be "cleaned" and pre-processed based on several patterns. For example:

enter image description here

I have done the following:

import pandas as pd
import re

df = pd.read_csv("SampleData.csv")
df['Address'] = (df['Address'].str.split()).str.join(' ')
df.head(10)

Index 1, shows the address "9 (4f7) Adam Road". Focusing on just " (4f7) " The pattern it provides is whitespace, parentheses, integer, character, parentheses, whitespace.

The output of this address should be "9/4 Adam Road". Hence, I need to carefully select the first integer from inside the parentheses and place it next to the first occurring integer in the address line (with a forward slash).

The second example would be, "1 (Flat 12) Subang Heights, Slateford,"

The output of the second example should be: "1/12 Subang Heights, Slateford". Here, I have to remove the word Flat and place the integer 12 next to the first occurring integer. There are occurrences of comma and symbols which needs to be removed as well.

What would be the best suggestion to go about this problem ? Should it be Regular Expressions or any other Python techniques applicable ?

Could someone with experiences in handling patterns such as this help me ?

UPDATE 1 (Data used):

18/8 Andy Pitt Drive
9 (4f7) Adam Road
2nd Floor, 12 China Town
1 Kuala Lumpur Park
23 KingsRoad North
81 (5F2) Prince Street
45/2 Brian's Court
1/2 Ships Corner
9/1 Eagle's Eye
1 (Flat 12) Subang Heights, Slateford,

Solution

  • You can capture the first number and second number in two groups and the remaining alphabetical address in third group using this regex,

    ^(\d+)\D+(\d+).*?\s+([a-zA-Z]+.*?)\W*$
    

    and replace everything with \1/\2 \3

    Demo

    Let me know if you face any issues with any sample, I'll refine the regex as needed. It was hard to use your sample data as you posted them as image. Please avoid posting as images and post as text.

    Sample Python code,

    import re
    
    arr = ["18/8 Andy Pitt Drive","9 (4f7) Adam Road","2nd Floor, 12 China Town","1 Kuala Lumpur Park","23 KingsRoad North","81 (5F2) Prince Street","45/2 Brian's Court","1/2 Ships Corner","9/1 Eagle's Eye","1 (Flat 12) Subang Heights, Slateford,"]
    
    for s in arr:
     print(s, '-->', re.sub(r'^(\d+)\D+(\d+).*?\s+([a-zA-Z]+.*?)\W*$', r'\1/\2 \3', s))
    

    Prints,

    18/8 Andy Pitt Drive --> 18/8 Andy Pitt Drive
    9 (4f7) Adam Road --> 9/4 Adam Road
    2nd Floor, 12 China Town --> 2/12 China Town
    1 Kuala Lumpur Park --> 1 Kuala Lumpur Park
    23 KingsRoad North --> 23 KingsRoad North
    81 (5F2) Prince Street --> 81/5 Prince Street
    45/2 Brian's Court --> 45/2 Brian's Court
    1/2 Ships Corner --> 1/2 Ships Corner
    9/1 Eagle's Eye --> 9/1 Eagle's Eye
    1 (Flat 12) Subang Heights, Slateford, --> 1/12 Subang Heights, Slateford