Search code examples
pythontext-manipulation

Extract a certain string which can appear several times in a file


I have a text file that I want to read and extract a certain string (which can appear several times). Then I want to print the result.

The string I'm trying to extract is the value of Rule MATCH Name.

Text file example:

201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 
201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/76.bin SCORE: 140 TYPE: EXE  AutoUpdates https://www.test.com/files:  **Rule MATCH Name**: this_is_test1 SUBSCORE:100
201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 
201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/7164.bin SCORE: 140 TYPE: EXE  AutoUpdates https://www.test.com/files:  **Rule MATCH Name**: this_is_test2 SUBSCORE:90 
201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 
201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/764.bin SCORE: 140 TYPE: EXE  AutoUpdates https://www.test.com/files:  **Rule MATCH Name**: this_is_test3 SUBSCORE:15

Solution

  • You can use regex to solve this problem. Regexr is a great website to create and test regex rules.
    Once you have a rule that fits your problem, load the file, use readlines() to get the text, and use python's re module to extract the values.

    I made a quick solution(not sure if this is the value you are trying to extract):

    import re
    fl = r'201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/76.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: Rule MATCH Name: this_is_test1 SUBSCORE:100 201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/7164.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: Rule MATCH Name: this_is_test2 SUBSCORE:90 201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/764.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: Rule MATCH Name: this_is_test3 SUBSCORE:15'
    
    re.findall(r'Rule MATCH Name:\s(\w+)\s', fl) 
    # ['this_is_test1', 'this_is_test2', 'this_is_test3']
    

    If reading from a file:

    import re
    with open('f.txt') as f:
        found = []
        for line in f.readlines():
            found += re.findall(r'Rule MATCH Name:\s(\w+)\s', line)
        print(found) # ['this_is_test1', 'this_is_test2', 'this_is_test3']