Search code examples
apache-kafkalogstashavroavro-tools

Logstash avro output cannot be decoded by apache avro-tools


I am a beginner with both Logstash and Avro. We are setting up a system with logstash as producer for a kafka queue. However, we are running into the problem that the avro serialized events produced by Logstash cannot be decoded by the avro-tools jar (version 1.8.2) that apache provides. Furthermore, we notice that the serialized output by Logstash and avro-tools differs.

We have the following setup:

  • logstash version 5.5
  • logstash avro codec version 3.2.1
  • kafka version 0.10.1
  • avro-tools jar version 1.8.2

As example, consider the following schema:

{
"name" : "avroTestSchema",
"type" : "record",
"fields" : [ {
  "name" : "testfield1",
  "type" : "string"
  },
  {
  "name" : "testfield2",
  "type" : "string"
  }
]
}

and the following json string:

{"testfield1":"somestring","testfield2":"anotherstring"}

When serializing using Logstash. Logstash config file:

input {
  stdin {
    codec => json
  }
}

filter {
 mutate {
    remove_field => ["@timestamp", "@version"]
  }
}

output {
  kafka {
    bootstrap_servers => "localhost:9092"
    codec => avro {
      schema_uri => "/path/to/TestSchema.avsc"
    }
    topic_id => "avrotestout"
  }
  stdout {
    codec => rubydebug
  }
}

output (using cat):

FHNvbWVzdHJpbmcaYW5vdGhlcnN0cmluZw==  

When serializing using avro-tools. command:

java -jar avro-tools-1.8.2.jar jsontofrag --schema-file TestSchema.avsc message.json

output

somestringanotherstring

command:

java -jar avro-tools-1.8.2.jar fromjson --schema-file TestSchema.avsc message.json

output:

Objavro.codenullavro.schema▒{"type":"record","name":"avroTestSchema","fields":[{"name":"testfield1","type":"string"},{"name":"testfield2","type":"string"}]}▒▒▒▒&70▒▒Hs▒U2somestringanotherstring▒▒▒▒&70▒▒Hs▒U

So our question is: How do we configure Logstash such that the output becomes compatible with the apache avro-tools jar?

UPDATE: We found out that the logstash produced avro output is base64 encoded. However cannot find where this happens, and how to make it avro-tools compatible


Solution

  • As mentioned in the update, we found out that the standard Logstash Avro codec adds a non optional base64 encoding to the avro output. We found this undesirable. So we forked the codec and made this encoding configurable. We tested this and it worked out of the box on several of our systems.

    The fork is available on github: https://github.com/Rubyan/logstash-codec-avro

    To set (or unset) the base64 encoding, add this to your logstash config file:

    output {
         stdout {
            codec => avro {
                schema_uri => "schema.avsc"
                base64_encoding => false
            }
        }
    }