Search code examples
python-3.xpandasregexapplyregex-lookarounds

I want to extract a words after a fix combination of words and special character also need a count of extracted words in particular cell


I have one data frame df which has one column. I want to extract a words after a fix combination of words & special character, also need a count of extracted words in that particular cell.

For example: (Most Recent Alarm Trigger)','valueString':'Tilt Sensor',(Most Recent Alarm Trigger)','valueString':'Hello world',(Most Recent Alarm Trigger)','valueString':'ABC',

Now from above line I want to extract any words in between commas after "(Most Recent Alarm Trigger)','valueString':"

So, in that case I want only 'Tilt Sensor' and its count in that particular cell.

I don't need 'Hello world' or 'ABC' as it comes 2nd or 3rd. Basically I want first search words.

Below is my df:-

import re
import pandas as pd
import numpy as np



data = {'product_name': ["[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Tilt Sensor','packetType':'enumerated','leastSigBit':440,,'Tilt Sensor','mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)', (Most Recent Alarm Trigger)','valueString':'Band','valueNumber':2022.0,'units':'Undefined / Not Used','packetType':'Tilt Sensor','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month,(Most Recent Alarm Trigger)','valueString':'Back space',{'name':'User Set Minute (Most Recent Alarm Trigger)','valueNumber':16.0,'units':'min','packetType':'unsigned','leastSigBit':400,'mostSigBit':407},'Tilt Sensor',{'name':'User Set Second (Most Recent Alarm Trigger)','valueNumber':36.0,'units':'s','packetType':'unsigned','leastSigBit':392,'mostSigBit':399}]",
                         "[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Volumetric Sensor','packetType':'enumerated','leastSigBit':440,'mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)','valueNumber':2022.0,'units':'(Most Recent Alarm Trigger)','valueString':'Being human','packetType':'unsigned','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month (Most Recent Alarm Trigger)','valueNumber':6.0,'(Most Recent Alarm Trigger)','valueString':'Hello'':'Month','Volumetric Sensor','packetType':'unsigned','leastSigBit':424,'mostSigBit':431},{'name':'User Set Day (Most Recent Alarm ]"]}
df = pd.DataFrame(data)
df

I tried regex or apply method but not getting what I want.

Below are some code which I have tried,

df["Extract"] = df["product_name"].apply(lambda st: st[st.find("(Most Recent Alarm Trigger)','valueString':")+1:st.find(",")])

df['Title'] = df.product_name.str.extract(r'"(Most Recent Alarm Trigger)','valueString':'"\s*([^\.]*)\s*\.', expand=False)

Below is my expected result:

data = {'product_name': ["[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Tilt Sensor','packetType':'enumerated','leastSigBit':440,,'Tilt Sensor','mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)', (Most Recent Alarm Trigger)','valueString':'Band','valueNumber':2022.0,'units':'Undefined / Not Used','packetType':'Tilt Sensor','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month,(Most Recent Alarm Trigger)','valueString':'Back space',{'name':'User Set Minute (Most Recent Alarm Trigger)','valueNumber':16.0,'units':'min','packetType':'unsigned','leastSigBit':400,'mostSigBit':407},'Tilt Sensor',{'name':'User Set Second (Most Recent Alarm Trigger)','valueNumber':36.0,'units':'s','packetType':'unsigned','leastSigBit':392,'mostSigBit':399}]",
                             "[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Volumetric Sensor','packetType':'enumerated','leastSigBit':440,'mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)','valueNumber':2022.0,'units':'(Most Recent Alarm Trigger)','valueString':'Being human','packetType':'unsigned','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month (Most Recent Alarm Trigger)','valueNumber':6.0,'(Most Recent Alarm Trigger)','valueString':'Hello'':'Month','Volumetric Sensor','packetType':'unsigned','leastSigBit':424,'mostSigBit':431},{'name':'User Set Day (Most Recent Alarm ]"],
       'Extarct': ['Tilt Sensor','Volumetric Sensor'],'Count': [4,2]}
df = pd.DataFrame(data)
df

Solution

  • One solution could be as follows:

    • Use Series.str.extract to get the first match between \'valueString\':\' and \',.
    • Next, use df.apply with a lambda function for each row (axis=1) to get a count for each value now stored in df.Extract inside the appropriate product_name string.
    import pandas as pd
    
    # also adding the string from your comment
    data = {'product_name': ["[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Tilt Sensor','packetType':'enumerated','leastSigBit':440,,'Tilt Sensor','mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)', (Most Recent Alarm Trigger)','valueString':'Band','valueNumber':2022.0,'units':'Undefined / Not Used','packetType':'Tilt Sensor','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month,(Most Recent Alarm Trigger)','valueString':'Back space',{'name':'User Set Minute (Most Recent Alarm Trigger)','valueNumber':16.0,'units':'min','packetType':'unsigned','leastSigBit':400,'mostSigBit':407},'Tilt Sensor',{'name':'User Set Second (Most Recent Alarm Trigger)','valueNumber':36.0,'units':'s','packetType':'unsigned','leastSigBit':392,'mostSigBit':399}]",
                             "[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Volumetric Sensor','packetType':'enumerated','leastSigBit':440,'mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)','valueNumber':2022.0,'units':'(Most Recent Alarm Trigger)','valueString':'Being human','packetType':'unsigned','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month (Most Recent Alarm Trigger)','valueNumber':6.0,'(Most Recent Alarm Trigger)','valueString':'Hello'':'Month','Volumetric Sensor','packetType':'unsigned','leastSigBit':424,'mostSigBit':431},{'name':'User Set Day (Most Recent Alarm ]",
                             "[{'name':'Power Mode Quality Factor','valueString':'Power Mode Undefined','valueString':'Finally',Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'No Trigger (Event Store Empty)',}]"]}
    df = pd.DataFrame(data)
    
    df['Extract'] = df.product_name.str.extract(
        r'\(Most Recent Alarm Trigger\)\',\'valueString\':\'(.*?)\',')
    # N.B. We're using the question mark to make the search for '.*' lazy
    
    df['Count'] = df.apply(lambda row: row.product_name.count(row.Extract), axis=1)
    
    print(df.iloc[:,1:])
    
                              Extract  Count
    0                     Tilt Sensor      4
    1               Volumetric Sensor      2
    2  No Trigger (Event Store Empty)      1
    

    N.B. If it is possible for str.extract to find no match, you'll end up with NaN values in df.Extract. If so, this will cause an error for df.apply(lambda row: row.product_name.count(row.Extract), axis=1) (since it is expecting a string). To avoid this, you could use:

    df['Count'] = df.apply(lambda row: row.product_name.count(row.Extract) 
                           if isinstance(row.Extract,str) else 0, axis=1)