Search code examples
pythonjsonnumpyapache-storm

How can I serialize a numpy array while preserving matrix dimensions?


numpy.array.tostring doesn't seem to preserve information about matrix dimensions (see this question), requiring the user to issue a call to numpy.array.reshape.

Is there a way to serialize a numpy array to JSON format while preserving this information?

Note: The arrays may contain ints, floats or bools. It's reasonable to expect a transposed array.

Note 2: this is being done with the intent of passing the numpy array through a Storm topology using streamparse, in case such information ends up being relevant.


Solution

  • pickle.dumps or numpy.save encode all the information needed to reconstruct an arbitrary NumPy array, even in the presence of endianness issues, non-contiguous arrays, or weird structured dtypes. Endianness issues are probably the most important; you don't want array([1]) to suddenly become array([16777216]) because you loaded your array on a big-endian machine. pickle is probably the more convenient option, though save has its own benefits, given in the npy format rationale.

    I'm giving options for serializing to JSON or a bytestring, because the original questioner needed JSON-serializable output, but most people coming here probably don't.

    The pickle way:

    import pickle
    a = # some NumPy array
    
    # Bytestring option
    serialized = pickle.dumps(a)
    deserialized_a = pickle.loads(serialized)
    
    # JSON option
    # latin-1 maps byte n to unicode code point n
    serialized_as_json = json.dumps(pickle.dumps(a).decode('latin-1'))
    deserialized_from_json = pickle.loads(json.loads(serialized_as_json).encode('latin-1'))
    

    numpy.save uses a binary format, and it needs to write to a file, but you can get around that with io.BytesIO:

    a = # any NumPy array
    memfile = io.BytesIO()
    numpy.save(memfile, a)
    
    serialized = memfile.getvalue()
    serialized_as_json = json.dumps(serialized.decode('latin-1'))
    # latin-1 maps byte n to unicode code point n
    

    And to deserialize:

    memfile = io.BytesIO()
    
    # If you're deserializing from a bytestring:
    memfile.write(serialized)
    # Or if you're deserializing from JSON:
    # memfile.write(json.loads(serialized_as_json).encode('latin-1'))
    memfile.seek(0)
    a = numpy.load(memfile)