I'm extracting data from an API and one of the fields is a string from which i want to extract multiple substrings(7 ideally). To get those substring I'm using the index() method.
string = r"""[Summary]
Reason: Not enough information
Improvements_Done: None
Improvements_Planned: Documentation
References_Improvements_Done: None
References_Improvements_Done: None
References_Improvements_Planned: www.link1.com
References_Improvements_Planned: www.link2.com
*** DEFAULT.....""".replace("\n", "\r\n")
Ex: imp_done_start = string.index('Improvements Done: ') + len('Improvements Done: ')
imp_done_end = string.index('Improvements_Planned')
imp_done = string[imp_done_start:imp_done_end]
There could be cases when one or more of these substrings(Reason ,Improvements_Done, Improvements_Planned etc) could be missing from the string. For example if "Improvements_Planned" is missing then i can't get the value for imp_done.
What is the best practice to handle these kind of cases?
The best practice depends largely on the format. However, in most cases, you can adopt a flexible approach and convert to an easier to parse/analyze intermediate representation:
import re
def parse(s: str) -> dict[str, str]:
d = {}
lines = s.splitlines()
for line in lines[1:-1]:
pattern = r"^(.*)?: (.*)$"
m = re.match(pattern, line)
if m is None:
continue
d[m.group(1)] = m.group(2)
return d
Usage:
>>> parse(string)
{'Improvements_Done': 'None',
'Improvements_Planned': 'Documentation',
'Reason': 'Not enough information',
'References_Improvements_Done': 'None',
'References_Improvements_Planned': 'www.link2.com'}
Now further analyse the result with any further rules required.