Search code examples
pythonapacheavro

Brutally slow Apache Avro performance in Python, different results when encoding to messages vs. files


So following the answer here: Encode an object with Avro to a byte array in Python I am able to send messages through ZeroMQ - but the performance is brutally slow.

This is to be expected since the Avro Python implementation is pure Python and we see similar performance comments from the author(s) of FastAvro. AFAIK, FastAvro cannot be used to generate a message for use with a message queue, it is geared to writing to files.

So back to the link above, I'm curious to know if this method is not more complicated than it actually needs to be - it seems strange that the Avro DatumWriter cannot be natively used to create something suitable for messaging.

This leads me to my final point (and reason for my suspicion). When I use the standard example from the Getting Started with Avro (Python) example, I can stream one of my binary files to the .avro file and it comes in around 5.8MB. When I used the message method to encode it as a byte array, that ends up with a total message size of 11MB. Why is there such a huge discrepancy between these methods? Presumably they would be quite similar...

Please note that I've removed the deflate codec from the writer example to ensure it's an apples-to-apples comparison. When deflate is enabled, the size is just 2.8MB.


Solution

  • I'm not sure how you are emitting messages, but you should be able to get fastavro to work. For example, since it can serialize to any file-like object, you can retrieve the bytes directly:

    from fastavro import dump
    from io import BytesIO
    
    # A sample schema.
    schema = {
      'name': 'Person',
      'type': 'record',
      'fields': [
        {'name': 'name', 'type': 'string'},
        {'name': 'age', 'type': 'int'}
      ]
    }
    
    record = {'name': 'Ann', 'age': 23} # Corresponding record.
    buf = BytesIO() # Target buffer (any file-like object would work here).
    dump(buf, record, schema) # Serialize record into buffer.
    message = buf.getvalue() # The raw bytes of your message.
    

    If you'd like to check that it worked:

    from fastavro import load
    
    buf.seek(0)
    print load(buf, schema) # {'age': 23, 'name': 'Ann'}
    

    If your messages have headers, footers, etc., you would just write them to buf as appropriate.

    Finally, about the size discrepancy, my suspicion would be that a bunch of redundant information gets included (maybe the schema?).