Search code examples
pythonregexextractattributeerror

Python: use .search method to extract everything between 2 words which occur more than once


I have a VHDL file which contains some paragraph I want to extract. Generally, it looks like this:

Declaration 1.
Some codes.
(Following are paragraphs I want to extract)
case (state) is
    case body 1
end case;

Declaration 2.
Some codes.
(Following are paragraphs I want to extract)
case (state) is
    case body 2
end case;

So the "case body 1" and "case body 2" are what I want. "case (state) is" and "end case;" can be matched along or not, it does not matter. I have tried some methods like:

f1=open('/home/liuduo/Desktop/f2.vhd')
data=f1.read()
pattern=re.compile('case (state) is[\s\S]*?end case;')
reg=pattern.search(data).group()

or

pattern=re.compile('(?<=\bcase\b).*?(?=\bend\b)')
reg=pattern.search(data).group() 

or

pattern=re.compile('.*?case(.*?)end.*?')
reg=pattern.search(data).group() 

and many other methods with the help of many examples in Stackflow (thank all!). But nothing seems to work.

The error I got is "AttributeError: 'NoneType' object has no attribute 'group'" which shows nothing is matched. I am quite new to Python (3 days...) and have weak background in JAVA so the REexp really confused me a lot. I wonder if anyone who can help me out with this?

Thank you so much!

P.S. If this is asked before, I am really sorry about this, first question on Stackflow after hours of searching for answers. PLZ help me.


Solution

  • Try

    pattern=re.compile(r'case \S+ is\s*(.*?)\s*end case', re.DOTALL)
    matches=pattern.findall(data)
    
    print(matches)
    # ['case body 1', 'case body 2']
    

    Your first regex fails because () are special characters in regex that need to be escaped to match them literally.

    Your second and third regex fail because a . doesn't match newlines by default.

    The search method only returns the first match, so I used findall to get a list of all the matches.

    Further explanation on request.