Search code examples
pythonpython-3.xregexpython-2.7python-re

Get list of struct names that are outside of 'package' and 'endpackage' optional strings


I am trying to get struct names which are outside package and endpackage optional strings. If there are no package and endpackage strings, then the script should return all the struct names.

This is my script:

import re

a = """
package new;

typedef struct packed
{
    logic a;
    logic b;
} abc_y;

typedef struct packed
{
    logic a;
    logic b;
} abc_t;

endpackage

typedef struct packed
{
    logic a;
    logic b;
} abc_x;

"""

print(re.findall(r'(?!package)*.*?typedef\s+struct\s+packed\s*{.*?}\s*(\w+);.*?(?!endpackage)*', a, re.MULTILINE|re.DOTALL))

This is the output:

['abc_y', 'abc_t', 'abc_x']

Expected output:

['abc_x']

I am missing something in the regex, but can't figure out what. Can someone please help me fixing this? Thanks in advance.


Solution

  • Use

    \bpackage.*?\bendpackage\b|typedef\s+struct\s+packed\s*{[^{}]*}\s*(\w+);
    

    See regex proof.

    EXPLANATION

    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    --------------------------------------------------------------------------------
      package                  'package'
    --------------------------------------------------------------------------------
      .*?                      any character except \n (0 or more times
                               (matching the least amount possible))
    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    --------------------------------------------------------------------------------
      endpackage               'endpackage'
    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    --------------------------------------------------------------------------------
     |                        OR
    --------------------------------------------------------------------------------
      typedef                  'typedef'
    --------------------------------------------------------------------------------
      \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      struct                   'struct'
    --------------------------------------------------------------------------------
      \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      packed                   'packed'
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      {                        '{'
    --------------------------------------------------------------------------------
      [^{}]*                   any character except: '{', '}' (0 or more
                               times (matching the most amount possible))
    --------------------------------------------------------------------------------
      }                        '}'
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      (                        group and capture to \1:
    --------------------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                                 more times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
      )                        end of \1
    --------------------------------------------------------------------------------
      ;                        ';'
    

    Python code:

    print(list(filter(None,re.findall(r'\bpackage.*?\bendpackage\b|typedef\s+struct\s+packed\s*{[^{}]*}\s*(\w+);', a, re.DOTALL))))
    

    Results: ['abc_x']