||/ software version some_text Description
+++-======================================-===================================================-============-===============================================================================
AA SOFTWARE1 1.1.1.1-UBUNTU GHGFHGFH Description1
AA SOFTWARE2 1.1.1.2-UBUNTU_HGSFHF JGJHGKGK Description2
BB SOFTWARE3 1.2.3.4.5 JHGJHGJG Description3
Above is a sample text format stored in a string. This could have asa many as 1000 of lines. Out of these, need to extract software and corresponding version details.
Approach1 : split based on new line and split each line based on space and capture the second and third item in a list (Not a great approach)
Approach2: Using regex to compile and store them.
I believe second approach is good.
regex = r".*(AA|BB)\s+(.*)\s+(.*)\s+(.*)\s+(.*)"
matches = re.finditer(regex, test_str, re.MULTILINE)
How can I grep those software and version details from each line and store them in dictionary or any other format ?
If you want 2 capture groups with the value of software(that can contain spaces) in group 1, and version in group 2, you can use the difference in number of whitespace chars between the values (Assuming the software does not contain more whitespace chars than between the fields)
^(?:AA|BB)\s{2,}(\S.*?)\s{2,}(\S+)
^
Start of string(?:AA|BB)
Match either AA
or BB
in a non capture group\s{2,}
Match 2 or more whitespace chars(\S.*?)
Group 1 capture a single non whitspace char followed by any char as least as possible\s{2,}
Match 2 or more whitespace chars(\S+)
group 2, capture 1+ non whitespace charsSee a regex demo.
If you want to create a dictionary with group 1 as the key and group 2 as the value:
import re
pattern = r"^(?:AA|BB)\s{2,}(\S.*?)\s{2,}(\S+)"
s = ("||/ software version some_text Description\n"
"+++-======================================-===================================================-============-===============================================================================\n"
"AA SOFTWARE1 this is some text 1.1.1.1-UBUNTU GHGFHGFH Description1\n"
"AA SOFTWARE2 1.1.1.2-UBUNTU_HGSFHF JGJHGKGK Description2\n"
"BB SOFTWARE3 1.2.3.4.5 JHGJHGJG Description3")
dct = dict(re.findall(pattern, s, re.M))
print(dct)
Output
{'SOFTWARE1 this is some text': '1.1.1.1-UBUNTU', 'SOFTWARE2': '1.1.1.2-UBUNTU_HGSFHF', 'SOFTWARE3': '1.2.3.4.5'}
You might also make the pattern a bit more specific, matching the example data for the version column:
^(?:AA|BB)\s{2,}(\S.*?)\s{2,}(\d+(?:\.\d+)*(?:-\w+)?)