I would like to output the StanfordNLP results in protobuf (since its size is much smaller) and read the results back in python. How should I do that?
I followed the instruction here to output the results serialized with ProtobufAnnotationSerializer
, like this:
java -cp "stanford-corenlp-full-2015-12-09/*" \
edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit \
-file input.txt \
-outputFormat serialized \
-outputSerializer \
edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer
Then use protoc
to compile the CoreNLP.proto
, which comes with the source code of StanfordNLP, into python modules like this:
protoc --python_out=. CoreNLP.proto
Then in python I read the files back like this:
import CoreNLP_pb2
doc = CoreNLP_pb2.Document()
doc.ParseFromString(open('input.txt.ser.gz', 'rb').read())
The parsing fails with the following error message
---------------------------------------------------------------------------
DecodeError Traceback (most recent call last)
<ipython-input-213-d8eaeb9c2048> in <module>()
1 doc = CoreNLP_pb2.Document()
----> 2 doc.ParseFromString(open('imed/s5_tokenized/conv-00000.ser.gz', 'rb').read())
/usr/local/lib/python2.7/dist-packages/google/protobuf/message.pyc in ParseFromString(self, serialized)
183 """
184 self.Clear()
--> 185 self.MergeFromString(serialized)
186
187 def SerializeToString(self):
/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/python_message.pyc in MergeFromString(self, serialized)
1092 # The only reason _InternalParse would return early is if it
1093 # encountered an end-group tag.
-> 1094 raise message_mod.DecodeError('Unexpected end-group tag.')
1095 except (IndexError, TypeError):
1096 # Now ord(buf[p:p+1]) == ord('') gets TypeError.
DecodeError: Unexpected end-group tag.
UPDATE:
I asked the author of the serializer Gabor Angeli and got the answer. The protobuf objects were written to the files with writeDelimitedTo
in this line. Changing it to writeTo
would make the output files readable in Python.
This question seems to have come up again, so I figured I'd write up a proper answer. The root of the issue is that the proto is written using Java's writeDelimitedTo
method, which Google has not implemented for Python. A workaround would be to use the following method to read the proto file (assuming the file is not gziped -- you can replace f.read()
with the appropriate code to unzip the file as appropriate):
from google.protobuf.internal.decoder import _DecodeVarint
import CoreNLP_pb2
def readCoreNLPProtoFile(protoFile):
protos = []
with open(protoFile, 'rb') as f:
# -- Read the file --
data = f.read()
# -- Parse the file --
# In Java. there's a parseDelimitedFrom() method that makes this easier
pos = 0
while (pos < len(data)):
# (read the proto)
(size, pos) = _DecodeVarint(data, pos)
proto = CoreNLP_pb2.Document()
proto.ParseFromString(data[pos:(pos+size)])
pos += size
# (add the proto to the list; or, `yield proto`)
protos.append(proto)
return protos
The file CoreNLP_pb2
is compiled from the CoreNLP.proto file in the repo with the command:
protoc --python_out /path/to/output/ /path/to/CoreNLP.proto
Note that as of writing this (version 3.7.0) the format is proto2, not proto3.