Search code examples

Why is this this python Lark grammar so slow?

I'm trying to parse the output of "ypcat -k netgroup" The output looks like many lines of this format:

group1 (host1,user1,domain1) (host2,user2,domain2) (host3,user3,domain3) ...

or sometimes

group2 groupa groupb groupc ...

I first tried using this lark grammar:

def getNetgroups():
  parser = Lark(ypcat_grammer)
  res = subprocess.check_output(['ypcat -k netgroup'], shell=True).decode('utf-8')

ypcat_grammer = r"""
  ?start: _line+
  _line: groupname members NEWLINE
  members: (member|groupname)*
  member: "(" hostname? "," username? "," domainname? ")"
  username: _name
  domainname: _name
  groupname: _name
  hostname: _name
  _name: /([a-zA-Z0-9_\.\-]+)/
  %import common.WS_INLINE
  %import common.NUMBER
  %import common.NEWLINE
  %ignore WS_INLINE

that took 60 seconds to parse 4000 lines!!? that seemed crazy long, so I write a hand-coded parser:

member = re.compile('\(([^,]*),([^,]*),([^,]*)\)')

def parseNetGroups():
  res = subprocess.check_output(['ypcat -k netgroup'], shell=True).decode('utf-8')
  rows = []
  for line in res.split('\n'):
    words = re.split('\s+', line)
    groupname = words.pop(0)
    members = []
    for word in words:
      if m:=member.match(word):
    rows.append({'GROUPNAME':groupname, 'MEMBERS':members})
  return pd.DataFrame(rows)

this took 0.8 seconds. What am I doing wrong?


  • changing to parser='lalr' reduced runtime to 3.8s. That's good enough for me.