Search code examples
pythonpandasextract

Extract store and location using regex and pandas


I want to extract store name and location from the to_clean column. I would like to use some regex pattern that accounts for all the different patterns and stores the store name in the store column and the location name in the location column.

Something like this:

to_clean = ['ACTIVE BODY #1 - WEST VAN', 'BUY-LOW (80018) - KINGSGATE', 'CHOICES - CAMBIE #906',
     'CAUSEWAY MASSET - CAMBIE', 'COMMUNITY NATURAL S2 - CHINOOK', 'MEINHARDT - GRANVILLE (80068)',
     'WFM - CAMBIE - 10248']

store = ['ACTIVE BODY', 'BUY-LOW', 'CHOICES', 'CAUSEWAY MASSET', 'COMMUNITY NATURAL', 'MEINHARDT', 'WFM']

location = ['WEST VAN', 'KINGSGATE', 'CAMBIE','CAMBIE', 'CHINOOK', 'GRANVILLE', 'CAMBIE']

data = pd.DataFrame(list(zip(to_clean, store, location)), columns=['to_clean', 'store', 'location'])
data

enter image description here

Any help is greatly appreciated.

Thank you


Solution

  • Try (I stored the new values under _new columns):

    data["store_new"] = (
        data["to_clean"]
        .str.extract(r"^((?:[A-Z]|(?<=[A-Z])-|\s)+)\b")[0]
        .str.strip()
    )
    data["location_new"] = (
        data["to_clean"]
        .str.extract(r"((?:[A-Z]|(?<=[A-Z])-|\s)+)[^A-Z]*$")[0]
        .str.strip()
    )
    
    print(data)
    

    Prints:

                             to_clean              store   location          store_new location_new
    0       ACTIVE BODY #1 - WEST VAN        ACTIVE BODY   WEST VAN        ACTIVE BODY     WEST VAN
    1     BUY-LOW (80018) - KINGSGATE            BUY-LOW  KINGSGATE            BUY-LOW    KINGSGATE
    2           CHOICES - CAMBIE #906            CHOICES     CAMBIE            CHOICES       CAMBIE
    3        CAUSEWAY MASSET - CAMBIE    CAUSEWAY MASSET     CAMBIE    CAUSEWAY MASSET       CAMBIE
    4  COMMUNITY NATURAL S2 - CHINOOK  COMMUNITY NATURAL    CHINOOK  COMMUNITY NATURAL      CHINOOK
    5   MEINHARDT - GRANVILLE (80068)          MEINHARDT  GRANVILLE          MEINHARDT    GRANVILLE
    6            WFM - CAMBIE - 10248                WFM     CAMBIE                WFM       CAMBIE