Search code examples
pythonpandasstringtext

String manipulation - for each value in a dataframe column, extract a custom string of text comprised between two separators


I have an imported dataframe which includes mainly text values under the column 'full_name' . The values look typically like this: 'Insulation, glass wool, water repellent, kraft paper lining, 2.25 Km2/W, 90 mm, 12 kg/m3, 1.08 kg/m2, IBR (Isover)'

Now I would like to extract from these certain values (physical properties of building materials) by utilising the measure unit as a search string, for instance 'Km2/W' for the sake of this example. Then I would like the text comprised between two commas separators before and after the search string to be copied in a separate column, where the values can ultimately be converted to numerical.

I asked this question to ChatGPT and it returned the following code: This code splits the text column by commas, removes any leading or trailing whitespace, selects the second column, extracts the substring that contains the search string and any characters after it, and then splits that substring by commas and selects the first part.

# Extract the text between two comma separators
filtered_df['extracted_text'] = filtered_df['text'].str.split(',', expand=True).apply(lambda x: x.str.strip()).iloc[:, 1].str.extract(f'({search_string}.*)')[0].str.split(',').str[0]

However the resulting column - in this example filtered_df['extracted_text'], is full of NaN. What do you think is going wrong here?


Solution

  • The ChatGPT code is much more complex than is necessary. I believe you should be able to achieve the result you're after with something like:

    search_string = "Km2/W"
    filtered_df['extracted_text'] = filtered_df['text'].str.extract(f', ([^,]*{search_string}),')
    

    df.str.extract uses regular expressions, which allow you to search for patterns, such as , ([^,]*{search_string}),', which searches for a comma followed by a space, then captures everything before the next comma if it contains the search term. If you'd like to learn regex, RegexLearn can help.