Is it better to specify HashMap's initial capacity when it is not a power of 2 than to not specify at all?

Suppose I know the exact number of key-value pairs that will go in the HashMap and I know that it is not a power of 2. In these cases, should I specify the initial capacity or not ? I could get the nearest power of 2 and specify that but still I'd like to know which would be the better thing to do in such cases (when I don't want to calculate the nearest power of 2).

Thanks!

Solution

You should consider the initial capacity a hint to the HashMap of approximately what data to expect. By providing a correct initial capacity, you minimize the number of times the map has to be rebuilt in order to scale up. If, for instance, you knew you were planning to insert a million records, by constructing the map with a 1,000,000 initial capacity, it will ensure, at construction time, to allocate enough memory to handle that many inserts. After that point, future inserts into the map may require a large O(n) operation during a map.put() call in order to resize.

Thinking of this initial capacity as a hint, rather than an instruction you expect HashMap to follow, may help you see that the optimization you're describing is unnecessary. HashMap is designed to behave well in all normal circumstances, and so while providing an initial capacity can help marginally, it's generally not going to have huge impacts on your code unless you are building many new large maps all the time. In such a case specifying the capacity would avoid the intermediate table resizing, but that's all.

As documented, you could introduce some unnecessary slowdowns if you specified too large of an initial capacity:

Iteration over collection views requires time proportional to the "capacity" of the HashMap instance

However in practice the wasted memory of allocating such large maps would likely cause problems for you sooner than the slightly slower iteration speed.

Be sure you read Why does HashMap require that the initial capacity be a power of two? as well.

One thing you might consider is switching to Guava's ImmutableMap implementation; if you know in advance the contents of your map, and you don't expect to change them, immutable collections will prove easier to work with and use less memory than their mutable counterparts.

Here's some quick inspections I did using Scala's REPL (and some personal utility functions) to inspect what's going on inside HashMap (Java 1.7):

// Initialize with capacity=7
scala> new HashMap[String,String](7)
res0: java.util.HashMap[String,String] = {}

scala> getPrivate(res0, "table").length
res1: Int = 8

scala> ... put 7 values

// Still internally using the same array
scala> getPrivate(res0, "table").length
res9: Int = 8

// Specifying capacity 9 allocates a 16-lenth array
scala> getPrivate(new HashMap[String,String](9), "table").length
res10: Int = 16

// Copying our first map into a new map interestingly
// also allocates the default 16 slots, rather than 8
scala> getPrivate(new HashMap[String,String](res0), "table").length
res11: Int = 16

scala> ... put 10 more values in our map

scala> getPrivate(res0,"table").length
res22: Int = 32

// Copying again immediately jumps to 32 capacity
scala> getPrivate(new HashMap[String,String](res0),"table").length
res23: Int = 32