I have the following text and I would like to detect the words after the subject, action and capabilities using regular expressions:
For this text:
T1 Subject num num xxx
T2 Action num num xxx
A1 Capability T2 xxx
I have created the following regex but it's not correct:
# Regular expressions for pattern matching
action_pattern = r'^T\d+\tAction \d+ \d+\t(.+)$'
subject_pattern = r'^T\d+\tSubject \d+ \d+;?\d+? \d+\t(.+)$'
object_pattern = r'^T\d+\tObject \d+ \d+;?\d+? \d+\t(.+)$'
capability_pattern = r'^A\d+\tCapability T\d+ (.+)$'
Here is what I've come up with:
text = """
For this text:
T1 Subject 11096 11100 They
T2 Action 11101 11106 steal
A1 Capability T2 007:MalwareCapability-data_theft
T3 Object 11107 11111 data
R1 SubjAction Subject:T1 Action:T2
R2 ActionObj Action:T2 Object:T3
T4 Subject 11127 11132;11140 11148 their implants
T5 Action 11152 11156 send
A2 Capability T5 006:MalwareCapability-data_exfiltration
T6 Object 11157 11161 data
T7 Modifier 11162 11168 out of
T8 Object 11169 11180 the network
T9 Modifier 11181 11186 using
T10 Object 11187 11195;11203 11224 a victim network`s mail server
"""
strings = text.split('\n')
action_pattern = r'Action\s[\s;\d]+(.*)$'
subject_pattern = r'Subject\s[\s;\d]+(.*)$'
object_pattern = r'Object\s[\s;\d:]+(.*)$'
capability_pattern = r'Capability\s+T[\s\d:]+(.*)$'
def extract(pattern, strings_lst):
return [re.search(pattern, string).group(1)
for string in strings_lst if re.search(pattern, string)]
print(extract(action_pattern, strings))
print(extract(subject_pattern, strings))
print(extract(object_pattern, strings))
print(extract(capability_pattern, strings))
Output
['steal', 'send']
['They', 'their implants']
['data', 'data', 'the network', 'a victim network`s mail server']
['MalwareCapability-data_theft', 'MalwareCapability-data_exfiltration']
You normally don't want to use such congested list comprehensions as in my functions, but for the sake of demonstration and shorter code I did this blasphemy.
Edit: Simplified regexes