Search code examples
protocol-buffersstanford-nlp

How to get protobuf extension field in ProtobufAnnotationSerializer


I am a new to protocol-buffers and try to figure out how to extend a message type in the Stanford CoreNLP library as described here: https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/ProtobufAnnotationSerializer.html

The problem: I can set the extension field but i can't get it. I boiled the problem down to the code below. In the original message the field name is [edu.stanford.nlp.pipeline.myNewField] but is replaced by the field number 101 in the deserialized message.

How can i get the value of myNewField?

PS: This post https://stackoverflow.com/questions/28815214/how-to-set-get-protobufs-extension-field-in-go suggests that it should be as easy as calling getExtension(MyAppProtos.myNewField)

custom.proto

syntax = "proto2";

package edu.stanford.nlp.pipeline;

option java_package = "com.example.my.awesome.nlp.app";
option java_outer_classname = "MyAppProtos";

import "CoreNLP.proto";

extend Sentence {
    optional uint32 myNewField = 101;
}

ProtoTest.java

import com.example.my.awesome.nlp.app.MyAppProtos;
import com.google.protobuf.ExtensionRegistry;
import com.google.protobuf.InvalidProtocolBufferException;

import edu.stanford.nlp.pipeline.CoreNLPProtos;
import edu.stanford.nlp.pipeline.CoreNLPProtos.Sentence;

public class ProtoTest {

    static {
        ExtensionRegistry registry = ExtensionRegistry.newInstance();
        registry.add(MyAppProtos.myNewField);
        CoreNLPProtos.registerAllExtensions(registry);
    }

    public static void main(String[] args) throws InvalidProtocolBufferException {

        Sentence originalSentence = Sentence.newBuilder()
                .setText("Hello world!")
                .setTokenOffsetBegin(0)
                .setTokenOffsetEnd(12)
                .setExtension(MyAppProtos.myNewField, 13)
                .build();

        System.out.println("Original:\n" + originalSentence);

        byte[] serialized = originalSentence.toByteArray();

        Sentence deserializedSentence = Sentence.parseFrom(serialized);
        System.out.println("Deserialized:\n" + deserializedSentence);

        Integer myNewField = deserializedSentence.getExtension(MyAppProtos.myNewField);
        System.out.println("MyNewField: " + myNewField);
    }
}

Output:

Original:
tokenOffsetBegin: 0
tokenOffsetEnd: 12
text: "Hello world!"
[edu.stanford.nlp.pipeline.myNewField]: 13

Deserialized:
tokenOffsetBegin: 0
tokenOffsetEnd: 12
text: "Hello world!"
101: 13

MyNewField: 0

Update Because this question was about extending CoreNLP message types and using them with the ProtobufAnnotationSerializer, here is what my extended serializer looks like:

import java.io.IOException;
import java.io.InputStream;
import java.util.Set;

import com.example.my.awesome.nlp.app.MyAppProtos;
import com.google.protobuf.ExtensionRegistry;

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.CoreNLPProtos;
import edu.stanford.nlp.pipeline.CoreNLPProtos.Sentence;
import edu.stanford.nlp.pipeline.CoreNLPProtos.Sentence.Builder;
import edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer;
import edu.stanford.nlp.util.CoreMap;
import edu.stanford.nlp.util.Pair;

public class MySerializer extends ProtobufAnnotationSerializer {

    private static ExtensionRegistry registry;

    static {
        registry = ExtensionRegistry.newInstance();
        registry.add(MyAppProtos.myNewField);
        CoreNLPProtos.registerAllExtensions(registry);
    }

    @Override
    protected Builder toProtoBuilder(CoreMap sentence, Set<Class<?>> keysToSerialize) {

        keysToSerialize.remove(MyAnnotation.class);
        Builder builder = super.toProtoBuilder(sentence, keysToSerialize);
        builder.setExtension(MyAppProtos.myNewField, 13);

        return builder;
    }

    @Override
    public Pair<Annotation, InputStream> read(InputStream is)
            throws IOException, ClassNotFoundException, ClassCastException {

        CoreNLPProtos.Document doc = CoreNLPProtos.Document.parseDelimitedFrom(is, registry);
        return Pair.makePair(fromProto(doc), is);
    }

    @Override
    protected CoreMap fromProtoNoTokens(Sentence proto) {

        CoreMap result = super.fromProtoNoTokens(proto);
        result.set(MyAnnotation.class, proto.getExtension(MyAppProtos.myNewField));

        return result;
    }
}

Solution

  • The mistake was that i didn't provide the parseFrom call with the extension registry.

    Changing Sentence deserializedSentence = Sentence.parseFrom(serialized); to Sentence deserializedSentence = Sentence.parseFrom(serialized, registry); did the job!