Search code examples
javaserializationenumset

EnumSet serialization


I've just lost a couple of hours debugging my app, and I believe I've stumbled upon a (another one o_O) Java bug... sniff... I hope it is not, because this would be sad :(

I'm doing the following:

  1. Creating an EnumSet mask with some flags
  2. Serializing it (with ObjectOutputStream.writeObject(mask))
  3. Clearing and setting some other flags in the mask
  4. Serializing it again

Expected result: the second serialized object is different from the first one (reflects the changes in the instance)

Obtained result: the second serialized object is the exact copy of the first one

The code:

enum MyEnum {
    ONE, TWO
}

@Test
public void testEnumSetSerialize() throws Exception {           
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    ObjectOutputStream stream = new ObjectOutputStream(bos);

    EnumSet<MyEnum> mask = EnumSet.noneOf(MyEnum.class);
    mask.add(MyEnum.ONE);
    mask.add(MyEnum.TWO);
    System.out.println("First serialization: " + mask);
    stream.writeObject(mask);

    mask.clear();
    System.out.println("Second serialization: " + mask);
    stream.writeObject(mask);
    stream.close();

    ObjectInputStream istream = new ObjectInputStream(new ByteArrayInputStream(bos.toByteArray()));

    System.out.println("First deserialized " + istream.readObject());
    System.out.println("Second deserialized " + istream.readObject());
}

It prints:

First serialization: [ONE, TWO]
Second serialization: []
First deserialized [ONE, TWO]
Second deserialized [ONE, TWO]  <<<<<< Expecting [] here!!!!

Am I using EnumSet incorrectly? Do I have to create a new instance each time instead of clearing it?

Thanks for your input!

**** UPDATE ****

My initial idea was to use an EnumSet as a mask to indicate which fields will be present or absent in the message that follows, so a sort of bandwidth and cpu usage optimization. It was very wrong!!! An EnumSet takes ages to serialize, and each instance takes 30 (!!!) bytes! So much for the space economy :)

In a nutshell, while ObjectOutputStream is very fast for primitive types (as I figured out already in a small test here: https://stackoverflow.com/a/33753694), it is painfully slooooow and inefficient with (especially small) objects...

So I worked around it by making my own EnumSet backed by an int, and serializing/deserializing the int directly (not the object).

static class MyEnumSet<T extends Enum<T>> {
    private int mask = 0;

    @Override
    public boolean equals(Object o) {
        if (o == null || getClass() != o.getClass()) return false;
        return mask == ((MyEnumSet<?>) o).mask;
    }

    @Override
    public int hashCode() {
        return mask;
    }

    private MyEnumSet(int mask) {
        this.mask = mask;
    }

    public static <T extends Enum<T>> MyEnumSet<T> noneOf(Class<T> clz) {
        return new MyEnumSet<T>(0);
    }

    public static <T extends Enum<T>> MyEnumSet<T> fromMask(Class<T> clz, int mask) {
        return new MyEnumSet<T>(mask);
    }

    public int mask() {
        return mask;
    }

    public MyEnumSet<T> add(T flag) {
        mask = mask | (1 << flag.ordinal());
        return this;
    }

    public void clear() {
        mask = 0;
    }
}

private final int N = 1000000;

@Test
public void testSerializeMyEnumSet() throws Exception {

    ByteArrayOutputStream bos = new ByteArrayOutputStream(N * 100);
    ObjectOutputStream out = new ObjectOutputStream(bos);

    List<MyEnumSet<TestEnum>> masks = Lists.newArrayList();

    Random r = new Random(132477584521L);
    for (int i = 0; i < N; i++) {
        MyEnumSet<TestEnum> mask = MyEnumSet.noneOf(TestEnum.class);
        for (TestEnum f : TestEnum.values()) {
            if (r.nextBoolean()) {
                mask.add(f);
            }
        }
        masks.add(mask);
    }

    logger.info("Serializing " + N + " myEnumSets");
    long tic = TicToc.tic();
    for (MyEnumSet<TestEnum> mask : masks) {
        out.writeInt(mask.mask());
    }
    TicToc.toc(tic);
    out.close();
    logger.info("Size: " + bos.size() + " (" + (bos.size() / N) + "b per object)");

    logger.info("Deserializing " + N + " myEnumSets");
    MyEnumSet<TestEnum>[] deserialized = new MyEnumSet[masks.size()];

    ObjectInputStream in = new ObjectInputStream(new ByteArrayInputStream(bos.toByteArray()));
    tic = TicToc.tic();
    for (int i = 0; i < deserialized.length; i++) {
        deserialized[i] = MyEnumSet.fromMask(TestEnum.class, in.readInt());
    }
    TicToc.toc(tic);

    Assert.assertArrayEquals(masks.toArray(), deserialized);

}

It's about 130x times faster during serialization and 25x times faster during deserialization...

MyEnumSets:

17/12/15 11:59:31 INFO - Serializing 1000000 myEnumSets
17/12/15 11:59:31 INFO - Elapsed time is 0.019 s
17/12/15 11:59:31 INFO - Size: 4019539 (4b per object)
17/12/15 11:59:31 INFO - Deserializing 1000000 myEnumSets
17/12/15 11:59:31 INFO - Elapsed time is 0.021 s

Regular EnumSets:

17/12/15 11:59:48 INFO - Serializing 1000000 enumSets
17/12/15 11:59:51 INFO - Elapsed time is 2.506 s
17/12/15 11:59:51 INFO - Size: 30691553 (30b per object)
17/12/15 11:59:51 INFO - Deserializing 1000000 enumSets
17/12/15 11:59:51 INFO - Elapsed time is 0.489 s

It's not as safe though. For example, it will not work for enums with more than 32 entries.

How can I ensure that the enum has less than 32 values on MyEnumSet creation?


Solution

  • ObjectOutputStream serializes references to objects and the first time an object is sent, the actual object. If you modify an object and send it again, all ObjectOutputStream does is send the reference to that object again.

    This has a few consequences

    • if you modify an object you won't see those modifications
    • it has to retain a reference to every object ever sent, on both ends. This can be a subtle memory leak.
    • the reason this is done is so you can serialize graphs of objects instead of trees. e.g. A points to B which points to A. You only want to send A once.

    The way to resolve this and get some memory back is to call reset() after each complete object. e.g. before calling flush()

    Reset will disregard the state of any objects already written to the stream. The state is reset to be the same as a new ObjectOutputStream. The current point in the stream is marked as reset so the corresponding ObjectInputStream will be reset at the same point. Objects previously written to the stream will not be referred to as already being in the stream. They will be written to the stream again.

    Another approach is to use writeUnshared, however this applies a shallow unshared-ness to the top level object. In the case of EnumSet it will be different, however the Enum[] it wraps is still shared o_O

    Writes an "unshared" object to the ObjectOutputStream. This method is identical to writeObject, except that it always writes the given object as a new, unique object in the stream (as opposed to a back-reference pointing to a previously serialized instance).

    In short, no this is not a bug, but expected behaviour.