Search code examples
javaperformancehashmapbitsparse-matrix

Why is it that, the more '1' bits in my Key, the longer it takes to place in the HashMap?


I'm doing a project for a class which focuses on storing a huge matrix with mostly 0 values in memory and performing some matrix math on it. My first thought was to use a HashMap to store the matrix elements, and only store the elements which are non-zero, in order to avoid using huge quantities of memory.

I wanted to make a key for the HashMap which would represent both the row and column number of the element in a way that, when I accessed that entry in the map, I could re-extract both values. I don't know Java as well as C#- in C# I would make a struct with Row and Column members, but in Java I quickly realized there are no User Value Types. With a deadline looming I went with a safe bet and made the Key a long. I stored the row data (32-bit int) in the first 32 bits and the column data in the last 32 using some very simple bit shifting. [EDIT: I'd also like to note that my HashMap is initialized with a specific initial size which exactly represents the number of values I store in it, which is never exceeded.]

[Side note: the reason I want to be able to extract the row/column data again is to greatly increase the efficiency of matrix multiplication, from O(n^2) to O(n), and a smaller n to boot]

What I noticed after implementing this structure is that it takes a whopping 7 seconds to read a 23426 x 23426 matrix from a text file in which only non-zero elements are given, but it only takes 2 seconds to calculate the eigen values we are required to give! After selective commenting-out of methods, I have concluded that the bulk of this 7 second timespan is spent storing my values in the HashMap.

public void Set(double value, int row, int column) {
    //assemble the long key, placing row and column in adjacent sets of bits
    long key = (long)row << SIZE_BIT_MAX; //(SIZE_BIT_MAX is 32)
    key += column;
    elements.put(key, value);
}

That is the code for setting a value. If I use this method instead:

public void Set(double value, int row, int column) {
    //create a distinct but smaller key (around 32 bits max)
    long key = (long)(row * matrixSize) + column;
    elements.put(key, value);
}

The reading only takes 2 seconds. Both of these versions of the key are distinct for every element, both are long type, and the actual code to create either of them is minimal in complexity. It's the elements.put(key, value) which makes the difference between 7 seconds and 2.

My question is, why? The difference I see between these key versions is that the first one has bits set to 1 throughout and more frequently, while the second has all of its highest 32 bits set to 0. Am I chasing a red herring, or is this fairly dramatic difference in performance the result of something internal in the HashMap.put method?


Solution

  • Take a look at how Long implements the hashCode() method (at least in OpenJDK 7):

    public int hashCode() {
        return (int)(value ^ (value >>> 32));
    }
    

    This means that your key gets stuffed back into 32 bits; all the lower bits are cancelling each other out quite often, resulting in a lot of collisions which requires the HashMap to spend extra time looking for a free slot in a bucket. Your second method avoids that problem because every key’s generated hash code is a unique value (because you only have 23426 x 23426 = 548777476 items which fits well into 32 bits).

    So, the resaon is your key selection but not the number of set bits.

    However, what exactly do you mean with “user value types?”

    public class MatrixKey {
        private final int row;
        private final int column;
        public MatrixKey(int row, int column) {
            this.row = row;
            this.column = column;
        }
        public int getRow() { return row; }
        public int getColumn() { return column; }
    }
    

    This class can make a perfectly good key for a Map in Java once you implement hashCode() and equals(). Just make sure that you don’t implement its hashCode method the way Long does. :)