Search code examples
pythonpandasdata-cleaning

Extract text by keyword between delimiters


Please help me solve the problem with clearing text from unnecessary parts.

I have an example of dataset:

df = pd.DataFrame({'addressfrom': ['Hüseyinağa, Rexee Hotel, Büyük Bayram Sokak', 'Rixos Premium', '123 Main St, Hotel Hilton Antalya', 'Residence Hotel & SPA, 1234']})

and a list of:

 keywords = ['hotel', 'resort', 'hilton', 'novotel', 'rixos', 'palace', 'residence', 'radisson', 'holiday', 'apartments', 'plaza', 'inn', 'club', 'spa']

I'm trying to extract a part of a string with keywords. At the same time, I need to eliminate the text that surrounds the desired part. I'm attempting to achieve this using a separator ',' in some cases it may be '-'. In the end, I want to achieve the following format.

index addressfrom
0 Rexee Hotel
1 Rixos Premium
2 Hotel Hilton Antalya
3 Residence Hotel & SPA

The best I managed to achieve was this

`df = pd.DataFrame({'addressfrom': ['Hüseyinağa, Rexee Hotel, Büyük Bayram Sokak', 'Rixos Premium', '123 Main St, Hotel Hilton Antalya', 'Residence Hotel & SPA, 1234']})

keywords = ['hotel', 'resort', 'hilton', 'novotel', 'rixos', 'palace', 'residence', 'radisson', 'holiday', 'apartments', 'plaza', 'inn', 'club', 'spa']

pattern = f'[^,]*({"|".join(keywords)})[^,]*'

df['addressfrom'] = df['addressfrom'].str.extract(pattern, flags=re.IGNORECASE)

print(df)`

Output:

index addressfrom
0 Hotel
1 Resort
2 Hilton
3 Rixos

Solution

  • One way to achieve this as per me is to split the address string using a comma as the separator, and then appliy the regex pattern to each part. Then extract the matched parts and join them back into a single string. Something like:

    def extract_keywords(s, keywords):
        pattern = f'[^,]*\\b({"|".join(keywords)})\\b[^,]*'
        match = re.search(pattern, s, flags=re.IGNORECASE)
        return match.group(0) if match else None
    
    df['addressfrom'] = df['addressfrom'].apply(lambda x: extract_keywords(x, keywords))
    

    CODE DEMO