Search code examples
pythonnlpnltkstanford-nlpcorpus

Converting NLTK phrase structure trees to BRAT .ann standoff


I'm trying to annotate a corpus of plain text. I'm working with systemic functional grammar, which is fairly standard in terms of part-of-speech annotation, but differs in terms of phrases/chunks.

Accordingly, I've POS tagged my data with NLTK defaults, and made a regex chunker with nltk.RegexpParser. Basically, the output now is an NLTK-style phrase structure tree:

Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])

There is some stuff I want to manually annotate on top of this, however: the systemic grammar breaks down participants and verbal groups into sub-types that probably can't be automatically annotated. So, I was hoping to convert the parse tree format into something an annotation tool (preferably BRAT) could handle, and then go through the text and specify the sub-types manually, as in (one possible solution):

BRAT annotation

Perhaps the solution would be sort of tricking BRAT into treating the phrase structure like dependencies? I could modify the chunking regex if need be. Are there any converters out there? (Brat provides ways of converting from CONLL2000 and Stanford Core NLP, so if I could get the phrase structure into either of those forms it would be acceptable too.)

Thanks!


Solution

  • Representing a non-binary tree as arcs will be difficult, but it is possible to nest "entity" annotations and use this for a constituency parse structure. Note that I'm not creating nodes for the terminals (part of speech tags) of the tree, partially because Brat is not currently good at displaying unary rules that often apply to terminals. The description of the target format is found here.

    Firstly, we need a function to produce standoff annotations. While Brat seeks standoff in terms of characters, in the following we just use token offsets, and will convert to characters below.

    (Note this uses NLTK 3.0b and Python 3)

    def _standoff(path, leaves, slices, offset, tree):
        width = 0
        for i, child in enumerate(tree):
            if isinstance(child, tuple):
                tok, tag = child
                leaves.append(tok)
                width += 1
            else:
                path.append(i)
                width += _standoff(path, leaves, slices, offset + width, child)
                path.pop()
        slices.append((tuple(path), tree.label(), offset, offset + width))
        return width
    
    
    def standoff(tree):
        leaves = []
        slices = []
        _standoff([], leaves, slices, 0, tree)
        return leaves, slices
    

    Applying this to your example:

    >>> from nltk.tree import Tree
    >>> tree = Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])
    >>> standoff(tree)
    (['This', 'is', 'a', 'representation', 'of', 'the', 'grammar', '.'],
     [((0, 0, 0), 'Participant', 0, 1),
      ((0, 0, 1), 'Verbal-group', 1, 2),
      ((0, 0, 2), 'Participant', 2, 4),
      ((0, 0, 3), 'Circumstance', 4, 7),
      ((0, 0), 'Process-dependencies', 0, 7),
      ((0,), 'Clause', 0, 7),
      ((), 'S', 0, 8)])
    

    This returns the leaf tokens, then a list of tuples corresponding subtrees with elements: (index into root, label, start leaf, stop leaf).

    To convert this into character standoff:

    def char_standoff(tree):
        leaves, tok_standoff = standoff(tree)
        text = ' '.join(leaves)
        # Map leaf index to its start and end character
        starts = []
        offset = 0
        for leaf in leaves:
            starts.append(offset)
            offset += len(leaf) + 1
        starts.append(offset)
        return text, [(path, label, starts[start_tok], starts[end_tok] - 1)
                      for path, label, start_tok, end_tok in tok_standoff]
    

    Then:

    >>> char_standoff(tree)
    ('This is a representation of the grammar .',
     [((0, 0, 0), 'Participant', 0, 4),
      ((0, 0, 1), 'Verbal-group', 5, 7),
      ((0, 0, 2), 'Participant', 8, 24),
      ((0, 0, 3), 'Circumstance', 25, 39),
      ((0, 0), 'Process-dependencies', 0, 39),
      ((0,), 'Clause', 0, 39),
      ((), 'S', 0, 41)])
    

    Finally, we can write a function that converts this to Brat's format:

    def write_brat(tree, filename_prefix):
        text, standoff = char_standoff(tree)
        with open(filename_prefix + '.txt', 'w') as f:
            print(text, file=f)
        with open(filename_prefix + '.ann', 'w') as f:
            for i, (path, label, start, stop) in enumerate(standoff):
                print('T{}'.format(i), '{} {} {}'.format(label, start, stop), text[start:stop], sep='\t', file=f)
    

    This writes the following to /path/to/something.txt:

    This is a representation of the grammar .
    

    and this to /path/to/something.ann:

    T0  Participant 0 4 This
    T1  Verbal-group 5 7    is
    T2  Participant 8 24    a representation
    T3  Circumstance 25 39  of the grammar
    T4  Process-dependencies 0 39   This is a representation of the grammar
    T5  Clause 0 39 This is a representation of the grammar
    T6  S 0 41  This is a representation of the grammar .