Converting NLTK phrase structure trees to BRAT .ann standoff

I'm trying to annotate a corpus of plain text. I'm working with systemic functional grammar, which is fairly standard in terms of part-of-speech annotation, but differs in terms of phrases/chunks.

Accordingly, I've POS tagged my data with NLTK defaults, and made a regex chunker with nltk.RegexpParser. Basically, the output now is an NLTK-style phrase structure tree:

Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])

There is some stuff I want to manually annotate on top of this, however: the systemic grammar breaks down participants and verbal groups into sub-types that probably can't be automatically annotated. So, I was hoping to convert the parse tree format into something an annotation tool (preferably BRAT) could handle, and then go through the text and specify the sub-types manually, as in (one possible solution):

BRAT annotation

Perhaps the solution would be sort of tricking BRAT into treating the phrase structure like dependencies? I could modify the chunking regex if need be. Are there any converters out there? (Brat provides ways of converting from CONLL2000 and Stanford Core NLP, so if I could get the phrase structure into either of those forms it would be acceptable too.)

Thanks!

Solution

Representing a non-binary tree as arcs will be difficult, but it is possible to nest "entity" annotations and use this for a constituency parse structure. Note that I'm not creating nodes for the terminals (part of speech tags) of the tree, partially because Brat is not currently good at displaying unary rules that often apply to terminals. The description of the target format is found here.

Firstly, we need a function to produce standoff annotations. While Brat seeks standoff in terms of characters, in the following we just use token offsets, and will convert to characters below.

(Note this uses NLTK 3.0b and Python 3)

def _standoff(path, leaves, slices, offset, tree):
    width = 0
    for i, child in enumerate(tree):
        if isinstance(child, tuple):
            tok, tag = child
            leaves.append(tok)
            width += 1
        else:
            path.append(i)
            width += _standoff(path, leaves, slices, offset + width, child)
            path.pop()
    slices.append((tuple(path), tree.label(), offset, offset + width))
    return width


def standoff(tree):
    leaves = []
    slices = []
    _standoff([], leaves, slices, 0, tree)
    return leaves, slices

Applying this to your example:

>>> from nltk.tree import Tree
>>> tree = Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])
>>> standoff(tree)
(['This', 'is', 'a', 'representation', 'of', 'the', 'grammar', '.'],
 [((0, 0, 0), 'Participant', 0, 1),
  ((0, 0, 1), 'Verbal-group', 1, 2),
  ((0, 0, 2), 'Participant', 2, 4),
  ((0, 0, 3), 'Circumstance', 4, 7),
  ((0, 0), 'Process-dependencies', 0, 7),
  ((0,), 'Clause', 0, 7),
  ((), 'S', 0, 8)])

This returns the leaf tokens, then a list of tuples corresponding subtrees with elements: (index into root, label, start leaf, stop leaf).

To convert this into character standoff:

def char_standoff(tree):
    leaves, tok_standoff = standoff(tree)
    text = ' '.join(leaves)
    # Map leaf index to its start and end character
    starts = []
    offset = 0
    for leaf in leaves:
        starts.append(offset)
        offset += len(leaf) + 1
    starts.append(offset)
    return text, [(path, label, starts[start_tok], starts[end_tok] - 1)
                  for path, label, start_tok, end_tok in tok_standoff]

Then:

>>> char_standoff(tree)
('This is a representation of the grammar .',
 [((0, 0, 0), 'Participant', 0, 4),
  ((0, 0, 1), 'Verbal-group', 5, 7),
  ((0, 0, 2), 'Participant', 8, 24),
  ((0, 0, 3), 'Circumstance', 25, 39),
  ((0, 0), 'Process-dependencies', 0, 39),
  ((0,), 'Clause', 0, 39),
  ((), 'S', 0, 41)])

Finally, we can write a function that converts this to Brat's format:

def write_brat(tree, filename_prefix):
    text, standoff = char_standoff(tree)
    with open(filename_prefix + '.txt', 'w') as f:
        print(text, file=f)
    with open(filename_prefix + '.ann', 'w') as f:
        for i, (path, label, start, stop) in enumerate(standoff):
            print('T{}'.format(i), '{} {} {}'.format(label, start, stop), text[start:stop], sep='\t', file=f)

This writes the following to /path/to/something.txt:

This is a representation of the grammar .

and this to /path/to/something.ann:

T0  Participant 0 4 This
T1  Verbal-group 5 7    is
T2  Participant 8 24    a representation
T3  Circumstance 25 39  of the grammar
T4  Process-dependencies 0 39   This is a representation of the grammar
T5  Clause 0 39 This is a representation of the grammar
T6  S 0 41  This is a representation of the grammar .