Search code examples
pythonpandasstringmatchcharacter

Trying to find all occurrences of a substring within a string, and also keep n characters afterwards in Python Pandas Dataframe


For a dataframe, I am trying to extract all occurrences of "cash" and then n characters after them (which contains the cash amount). I have tried JSON, Regex, but they do not work as this dataframe is quite inconsistent.

So for example,

sample = pd.DataFrame({'LongString': ["I am trying to find out how much cash 15906810 
and this needs to be consistent cash :  69105060", 
"other words that are wrong cash : 11234 and more words cash 1526
"]})

And then my dataframe will look like

sample_resolved = pd.DataFrame({'LongString': ["I am trying to find out how much cash 15906810 
and this needs to be consistent cash :  69105060", 
"other words that are wrong cash : 11234 and more words cash 1526
"], 'cash_string' = ["cash  15906810 cash : 69105060", "cash : 11234 cash 1526]})

Each row of the dataframe is inconsistent. The ultimate goal is to create a new column that has all instances of "cash" followed by let's say 8-10 characters after it.

The ultimate goal would be to have a line that goes

df['cash_string'] = df['LongString'].str.findall('cash') 

(but also includes the n characters after each 'cash' instance)

Thank you!


Solution

  • In general, if there isn't a dataframe method (or combination thereof) that does what you're after, you can write a function that works on a single example and then pass it to the dataframe with series.apply(some_func).

    So, a function that does what you're looking for:

    def str_after_substr(s, substr='cash', offset=5):
        i = s.index(substr)
        start = i+len(substr)
        return s[start:start+offset]
    # test
    str_after_substr('moneymoneycashmoneyhoney')
    
    # create the new column values and add it to the df
    df['new_column] = df['old_column'].apply(str_after_substr)
    

    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html