Read Protobuf Serialization of StanfordNLP Output in Python

I would like to output the StanfordNLP results in protobuf (since its size is much smaller) and read the results back in python. How should I do that?

I followed the instruction here to output the results serialized with ProtobufAnnotationSerializer, like this:

java -cp "stanford-corenlp-full-2015-12-09/*" \
edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit \
-file input.txt \
-outputFormat serialized \
-outputSerializer \
edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer

Then use protoc to compile the CoreNLP.proto, which comes with the source code of StanfordNLP, into python modules like this:

protoc --python_out=. CoreNLP.proto

Then in python I read the files back like this:

import CoreNLP_pb2
doc = CoreNLP_pb2.Document()
doc.ParseFromString(open('input.txt.ser.gz', 'rb').read())

The parsing fails with the following error message

---------------------------------------------------------------------------
DecodeError                               Traceback (most recent call last)
<ipython-input-213-d8eaeb9c2048> in <module>()
      1 doc = CoreNLP_pb2.Document()
----> 2 doc.ParseFromString(open('imed/s5_tokenized/conv-00000.ser.gz', 'rb').read())

/usr/local/lib/python2.7/dist-packages/google/protobuf/message.pyc in ParseFromString(self, serialized)
    183     """
    184     self.Clear()
--> 185     self.MergeFromString(serialized)
    186 
    187   def SerializeToString(self):

/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/python_message.pyc in MergeFromString(self, serialized)
   1092         # The only reason _InternalParse would return early is if it
   1093         # encountered an end-group tag.
-> 1094         raise message_mod.DecodeError('Unexpected end-group tag.')
   1095     except (IndexError, TypeError):
   1096       # Now ord(buf[p:p+1]) == ord('') gets TypeError.

DecodeError: Unexpected end-group tag.

UPDATE:

I asked the author of the serializer Gabor Angeli and got the answer. The protobuf objects were written to the files with writeDelimitedTo in this line. Changing it to writeTo would make the output files readable in Python.

Solution

This question seems to have come up again, so I figured I'd write up a proper answer. The root of the issue is that the proto is written using Java's writeDelimitedTo method, which Google has not implemented for Python. A workaround would be to use the following method to read the proto file (assuming the file is not gziped -- you can replace f.read() with the appropriate code to unzip the file as appropriate):

from google.protobuf.internal.decoder import _DecodeVarint
import CoreNLP_pb2

def readCoreNLPProtoFile(protoFile):
  protos = []
  with open(protoFile, 'rb') as f:
    # -- Read the file --
    data = f.read()
    # -- Parse the file --
    # In Java. there's a parseDelimitedFrom() method that makes this easier
    pos = 0
    while (pos < len(data)):
      # (read the proto)
      (size, pos) = _DecodeVarint(data, pos)
      proto = CoreNLP_pb2.Document()
      proto.ParseFromString(data[pos:(pos+size)])
      pos += size
      # (add the proto to the list; or, `yield proto`)
      protos.append(proto)
  return protos

The file CoreNLP_pb2 is compiled from the CoreNLP.proto file in the repo with the command:

protoc --python_out /path/to/output/ /path/to/CoreNLP.proto

Note that as of writing this (version 3.7.0) the format is proto2, not proto3.