Search code examples
pythonparsingtextmultiline

How to grep necessary info from multi-line data using regex?


||/ software                                   version                                          some_text    Description
+++-======================================-===================================================-============-===============================================================================
AA  SOFTWARE1                                   1.1.1.1-UBUNTU                                  GHGFHGFH     Description1
AA  SOFTWARE2                                   1.1.1.2-UBUNTU_HGSFHF                           JGJHGKGK     Description2
BB  SOFTWARE3                                   1.2.3.4.5                                       JHGJHGJG     Description3

Above is a sample text format stored in a string. This could have asa many as 1000 of lines. Out of these, need to extract software and corresponding version details.

Approach1 : split based on new line and split each line based on space and capture the second and third item in a list (Not a great approach)

Approach2: Using regex to compile and store them.

I believe second approach is good.

regex = r".*(AA|BB)\s+(.*)\s+(.*)\s+(.*)\s+(.*)"
matches = re.finditer(regex, test_str, re.MULTILINE)

How can I grep those software and version details from each line and store them in dictionary or any other format ?


Solution

  • If you want 2 capture groups with the value of software(that can contain spaces) in group 1, and version in group 2, you can use the difference in number of whitespace chars between the values (Assuming the software does not contain more whitespace chars than between the fields)

    ^(?:AA|BB)\s{2,}(\S.*?)\s{2,}(\S+)
    
    • ^ Start of string
    • (?:AA|BB) Match either AA or BB in a non capture group
    • \s{2,} Match 2 or more whitespace chars
    • (\S.*?) Group 1 capture a single non whitspace char followed by any char as least as possible
    • \s{2,} Match 2 or more whitespace chars
    • (\S+) group 2, capture 1+ non whitespace chars

    See a regex demo.

    If you want to create a dictionary with group 1 as the key and group 2 as the value:

    import re
    
    pattern = r"^(?:AA|BB)\s{2,}(\S.*?)\s{2,}(\S+)"
    
    s = ("||/ software                                   version                                          some_text    Description\n"
                "+++-======================================-===================================================-============-===============================================================================\n"
                "AA  SOFTWARE1 this is some text                                   1.1.1.1-UBUNTU                                  GHGFHGFH     Description1\n"
                "AA  SOFTWARE2                                   1.1.1.2-UBUNTU_HGSFHF                           JGJHGKGK     Description2\n"
                "BB  SOFTWARE3                                   1.2.3.4.5                                       JHGJHGJG     Description3")
    
    
    dct = dict(re.findall(pattern, s, re.M))
    print(dct)
    

    Output

    {'SOFTWARE1 this is some text': '1.1.1.1-UBUNTU', 'SOFTWARE2': '1.1.1.2-UBUNTU_HGSFHF', 'SOFTWARE3': '1.2.3.4.5'}
    

    You might also make the pattern a bit more specific, matching the example data for the version column:

    ^(?:AA|BB)\s{2,}(\S.*?)\s{2,}(\d+(?:\.\d+)*(?:-\w+)?)
    

    Regex demo