Assuming a VCD file with a structure like the one that follows as a minimum example:
#0 <--- section
b10000011#
0$
1%
0&
1'
0(
0)
#2211 <--- section
0'
#2296 <--- section
b0#
1$
#2302 <--- section
0$
I want to split the whole thing into timestamp sections and search in every one of them for certain values. That is to first isolate the section inbetween the #0
and #2211
timestamp, then the section inbetween the #2211
and #2296
and so on.
I am trying to do this with python in the following way.
search_space = "
#0
b10000011#
0$
1%
0&
1'
0(
0)
#2211
0'
#2296
b0#
1$
#2302
0$"
# the "delimiter"
timestamp_regex = "\#[0-9]+(.*)\#[0-9]+"
for match in re.finditer(timestamp_regex, search_space, flags=re.DOTALL|re.MULTILINE):
print(match.groups())
But it has no effect. What is the proper way to handle such scenario with the re
package?
You need to use a lazy quantifier ?
here.
I made some little changes like this:
timestamp_regex = r"(\#[0-9]+)(.+?)(?=\#[0-9]+|\Z)"
for match in re.finditer(timestamp_regex, search_space, flags=re.DOTALL|re.MULTILINE):
print(f"section: {match.group(1)}\nchunk:{match.group(2)}\n----")
output:
section: #0
chunk:
b10000011#
0$
1%
0&
1'
0(
0)
----
section: #2211
chunk:
0'
----
section: #2296
chunk:
b0#
1$
----
section: #2302
chunk:
0$
----
Check the pattern at Regex101
Details:
(\#[0-9]+)
- 1st capturing group consisting of #
and one or more digits(.+?)
- 2nd capturing group - match anything one or more times non-greedy (match as little as possible)(?=\#[0-9]+|\Z)
- Positive lookahead on \#[0-9]+
OR \Z
which is the end of your input string (2nd capturing group is followed by either another section or the end of string). End of string is needed here because for the last section there is only the chunk and no following #[0-9]+
, so the chunk is followed by end of string.