Search code examples
pythoncorpus

converting a treebank of vertical trees to s-expressions


I have a collection of parse trees, and they are in this ascii representation where indentation determines the structure (and closing brackets are implicit). I need to convert them to s-expressions so that parentheses determine the structure. It's a little bit like python's significant whitespace vs. braces. The input format is a vertical representation of trees, like so:

STA:fcl
=S:np
==DN:pron-dem("tia" <*> <Dem> <Du> <dem> DET P NOM)     Tiaj
==H:n("akuzo" <act> <sd> P NOM) akuzoj
=fA:adv("certe")        certe
=P:v-fin("dauxri" <va+TEMP> <mv> FUT VFIN)      dauxros
.

Should become:

(STA:fcl (S:np (DN:pron-dem Tiaj) (H:n akuzoj)) (fA:adv certe) (P:v-fin dauxros) .)

I have code that almost does it, but not quite. There's always a missing paren somewhere; it's getting very frustrating. Should I use a proper parser, maybe a CFG? The current (messy) code is at http://github.com/andreasvc/eodop/blob/master/arbobanko.py


Solution

  • Focusing only on the example you're giving in this Q, and the Q's title about converting vertical trees to S-expressions, something like...:

    import re
    import sys
    
    samp='''S
    =NP
    ==(DT +def) the
    == (N +ani) man
    =VP
    ==V walks'''.splitlines()
    
    relinelev = re.compile(r'(=*)(.*)')
    reclean = re.compile(r'\s*\((\S+)[^)]*\)')
    
    def clean(line):
      return reclean.sub(r'\1', line)
    
    def reparse(tree=samp):
      stack = [-1]
      for line in tree:
        equals, rest = relinelev.match(line).groups()
        linelev = len(equals)
        while linelev < stack[-1]:
          sys.stdout.softspace = False
          print ')',
          curlev = stack.pop()
        if linelev == stack[-1]:
          sys.stdout.softspace = False
          print ')',
        else:
          stack.append(linelev)
        print '(%s' % clean(rest),
      while stack[-1] >= 0:
        sys.stdout.softspace = False
        print ')',
        stack.pop()
      print
    
    reparse()
    

    seems to work, and outputs

    (S (NP (DT the) (N man)) (VP (V walks)))
    

    I realize you're trying to do much more "cleaning" than I'm doing here, but that can be concentrated in the clean function, leaving reparse to deal with the Q's title. If you don't want to print as you go, but rather return the result as a string, the changes are of course quite minor:

    def reparse(tree=samp):
      stack = [-1]
      result = []
      for line in tree:
        equals, rest = relinelev.match(line).groups()
        linelev = len(equals)
        while linelev < stack[-1]:
          result[-1] += ')'
          curlev = stack.pop()
        if linelev == stack[-1]:
          result[-1] += ')'
        else:
          stack.append(linelev)
        result.append('(%s' % clean(rest))
      while stack[-1] >= 0:
        result[-1] += ')'
        stack.pop()
      return ' '.join(result)