Search code examples
pythonregexnlp

Regex to detect words based on the words Action, Object, Sumbject, etc in the middle of a text


I have the following text and I would like to detect the words after the subject, action and capabilities using regular expressions:

For this text:
T1  Subject num num xxx
T2  Action num num  xxx
A1  Capability T2 xxx

I have created the following regex but it's not correct:

# Regular expressions for pattern matching
action_pattern = r'^T\d+\tAction \d+ \d+\t(.+)$'
subject_pattern = r'^T\d+\tSubject \d+ \d+;?\d+? \d+\t(.+)$'
object_pattern = r'^T\d+\tObject \d+ \d+;?\d+? \d+\t(.+)$'
capability_pattern = r'^A\d+\tCapability T\d+ (.+)$'

Solution

  • Here is what I've come up with:

    text = """
    For this text:
    T1  Subject 11096 11100 They
    T2  Action 11101 11106  steal
    A1  Capability T2 007:MalwareCapability-data_theft
    T3  Object 11107 11111  data
    R1  SubjAction Subject:T1 Action:T2 
    R2  ActionObj Action:T2 Object:T3   
    T4  Subject 11127 11132;11140 11148 their implants
    T5  Action 11152 11156  send
    A2  Capability T5 006:MalwareCapability-data_exfiltration
    T6  Object 11157 11161  data
    T7  Modifier 11162 11168    out of
    T8  Object 11169 11180  the network
    T9  Modifier 11181 11186    using
    T10 Object 11187 11195;11203 11224  a victim network`s mail server
    """
    strings = text.split('\n')
    
    action_pattern = r'Action\s[\s;\d]+(.*)$'
    subject_pattern = r'Subject\s[\s;\d]+(.*)$'
    object_pattern = r'Object\s[\s;\d:]+(.*)$'
    capability_pattern = r'Capability\s+T[\s\d:]+(.*)$'
    
    def extract(pattern, strings_lst):
        return [re.search(pattern, string).group(1) 
                for string in strings_lst if re.search(pattern, string)]
    
    
    print(extract(action_pattern, strings))
    print(extract(subject_pattern, strings))
    print(extract(object_pattern, strings))
    print(extract(capability_pattern, strings))
    

    Output

    ['steal', 'send']
    ['They', 'their implants']
    ['data', 'data', 'the network', 'a victim network`s mail server']
    ['MalwareCapability-data_theft', 'MalwareCapability-data_exfiltration']
    

    You normally don't want to use such congested list comprehensions as in my functions, but for the sake of demonstration and shorter code I did this blasphemy.

    Edit: Simplified regexes