Search code examples
javacollectionsguavaspace-efficiency

List of maps: efficient implementations


I have code that creates and uses a collection such as:

List<Map<String, Object>> tableData;

This list of maps gets populated with n maps each representing one row in a database. Each row is represented as a map between the field name and the object corresponding to the field (the type is irrelevant in this case). Some of the fields might be missing. The number of fields, m is always much smaller than the number of rows (n ≈ 10000 × m). I need to reuse the same collection a few times to read through all the rows, so I can't just use some kind of lazy iterator.

Is there an efficient data structure to store this? Guava provides a Table collection but that doesn't seem to fit the bill. I am thinking about creating an interface such as:

interface TableData{
  int size();
  Map<String, Object> get(int i);
  // ... (interators, etc.)
}

And then create an implementation that uses one Map<String,List<Object>> so that I only instantiate m lists instead of n maps and create maps on the fly only when needed but I was wondering if there was a more general purpose data structure out there.

Thanks


Solution

  • I ran some tests (not conclusive by any means but very indicative) to establish the memory footprint of different List<Map<String, Object>> implementations. The baseline is Java's ArrayList<> with the elements being instances of Guava's ImmutableMap.

    The implementations I compared to are the following:

    1. Implementation based on a Map<String,List<Object>> using a HashMap and ArrayLists;
    2. Implementation based on a List<Object[]> using an ArrayList;
    3. Guava's HashBasedTable<Integer,String,Object>;
    4. Guava's ArrayTable<Integer,String,Object>;

    My test consisted in generating n random rows each having m columns and a "fill factor" of k, where the fill factor is defined as the probability that each row contains values for all the columns. For simplicity, the values are random strings of length l generated using Apache Commons RandomStringUtils.

    But let's get to the results. Having n = 200000, m = 50, l = 10 and k in (1.0, 7.5, 0.5) I got the following memory footprints as percentage of the baseline:

        | k = 1.0  | k = 0.75 | k = 0.5  |
    ----------------------------------------
    1.  |     71 % |     71 % |     71 % |
    2.  |     71 % |     72 % |     73 % |
    3.  |    111 % |    107 % |    109 % |
    4.  |     71 % |     73 % |     76 % |
    

    I tried reducing n to 20000 with about the same results.

    I found the results above quite interesting. First of all, it looks like there isn't much space for improvement beyond 70% of the baseline. Second, I was pleasantly surprised to find out that the efficient Guava's ArrayTable is as good as the two implementations proposed in this question. I'll keep digging for more but I'm leaning towards solution 1.

    Thanks