Search code examples
pythonpandassubstringnltktokenize

Python: Get numeric in a DataFrame String Object which start and end with specific word


I would like the string number as integer for the total cost that appear between total and USD.

Example DataFrame:

    id   name    lastname   message 
0   1   John    Doe        John have 100 USD, so he buy 5 eggs which total cost 10 USD
1   2   Mar     Aye        Mar have 10 USD, he just buy a banana from another shop for 16 USD

So the final result should be:

    id   name    lastname   message                                                             total
0   1   John    Doe        John have 100 USD, so he buy 5 eggs which total cost 10 USD         10
1   2   Mar     Aye        Mar have 10 USD, he just buy a banana from another shop for 16 USD  0

Solution

  • You can use a regex to capture any number that appears between "total" and "USD".

    The below code will capture any number (first one if multiple, need some adjustments if floats should be accepted but since the type should be an int there should be no need) and convert it to int type.

    df['total'] = df['message'].str.extract('total.*?(\d+).*?USD').fillna(0).astype(int)
    

    Result:

    id   name    lastname   message                                                             total
    0   1   John    Doe        John have 100 USD, so he buy 5 eggs which total cost 10 USD         10
    1   2   Mar     Aye        Mar have 10 USD, he just buy a banana from another shop for 16 USD  0