Search code examples
javagarbage-collectionjvmfile-processing

What is an overhead for creating Java objects from lines of csv file


the code reads lines of CSV file like:

Stream<String> strings = Files.lines(Paths.get(filePath))

then it maps each line in the mapper:

List<String> tokens = line.split(","); return new UserModel(tokens.get(0), tokens.get(1), tokens.get(2), tokens.get(3));

and finally collects it:

Set<UserModel> current = currentStream.collect(toSet())

File size is ~500MB I've connected to the server using jconsole and see that heap size grew from 200MB to 1.8GB while processing.

I can't understand where this x3 memory usage came from - I expected something like 500MB spike or so?

My first impression was it's because there is no throttling and garbage collector simply doesn't have enough time for cleanup. But I've tried to use guava rate limiter to let garbage collector time to do it's job but result is the same.


Solution

  • Tom Hawtin made good points - I just wanna expand on them and provide a bit more details.

    Java Strings take at least 40 bytes of memory (that's for empty string) due to java object header (see later) overhead and an internal byte array. That means the minimal size for non-empty string (1 or more characters) is 48 bytes.

    Nowawadays, JVM uses Compact Strings which means that ASCII-only strings only occupy 1 byte per character - before it was 2 bytes per char minimum. That means if your file contains characters beyond ASCII set, then memory usage can grow significantly.

    Streams also have more overhead compared to plain iteration with arrays/lists (see here Java 8 stream objects significant memory usage)

    I guess your UserModel object adds at least 32 bytes overhead on top of each line, because:

    • the minimum size of java object is 16 bytes where first 12 bytes are the JVM "overhead": object's class reference (4 bytes when Compressed Oops are used) + the Mark word (used for identity hash code, Biased locking, garbage collectors)
    • and the next 4 bytes are used by the reference to the first "token"
    • and the next 12 bytes are used by 3 references to the second, third and fourth "token"
    • and the last 4 bytes are required due to Java Object Alignment at 8-byte boundaries (on 64-bit architectures)

    That being said, it's not clear whether you even use all the data that you read from the file - you parse 4 tokens from a line but maybe there are more? Moreover, you didn't mention how exactly the heap size "grew" - If it was the commited size or the used size of the heap. The used portion is what actually is being "used" by live objects, the commited portion is what has been allocated by the JVM at some point but could be garbage-collected later; used < commited in most cases.

    You'd have to take a heap snapshot to find out how much memory actually the result set of UserModel occupies and that would actually be interesting to compare to the size of the file.