Search code examples
javaconcurrenthashmapsocial-graph

ConcurrentHashMap implementation and limitations


I have quite a large project to accomplish and I'm running into some dead ends. I wanted to see if the great community here had any suggestions.

I have a large data set and I'm attempting to build a social graph. The data contains over 9.5 million mappings of coordinates to a Short value. For the key values in the ConcurrentHashMap I am using a String, that is the coordinates concatenated with a ',' in between.

Essentially, I'm finding the number of groups in common between users. I have an initial hashmap that is built quite easily that maps a GroupID to a Vector of AvatarID's. This part runs fine. Then, I have 12 threads which are responsible for their own set of GroupIDs and processing (adding + 1 to the count between users in each groupID), all the accessing done from the ConcurrentHashMap.

After about 8000 groups have been processed, a problem with the accessing occurs. Only one thread at a time seems active, and I'm unsure if this is because of the massive size or another factor. This is a problem as I have 300,000 groups that need to be processed total (and in a timely manner).

Is there any advice as to how I'm implementing this, and any shortcuts I can use? The read and write I believe is equally important, as I have to read a coordinate if the value exists (if not create it) and then add one to the value and write back.

I am willing to provide code as needed, I just don't know which portions will be relevant to the discussion yet.

Thanks for your time, -mojavestorm

Further explanation:

Two Implementations and their limits:

1) I have a HashMap(Integer, Vector(Integer)) preMap that contains a GroupID as key and a Vector of userIDs. The threads split up the GroupIDs between each other and using each Vector(Integer) returned, each thread stores a short value according to a coordinate (saying UserID x and UserID y belong in (short) n groups together) into a TLongShortHashMap threadMap, and each thread owns its own threadMap. The coordinates are mapped to long values. After each thread is completed, the values of corresponding keys in each of the threadMaps are added to the same key in a combinedMap, which will show how many groups UserID x and UserID y belong to together in the whole system.

The problem with this implementation is that there is high overlap between threads, so excessive short values are created. For example User 1 and User 2 belong to various groups together. Thread A and Thread B are responsible for a their own range of groups, including ones User 1 and User 2 belong to, so both Thread A and Thread B store in their copy of the threadMap a long value for coordinate (1, 2) and a short value. If excessive overlap occurs then the memory requirement can be outstanding. In my case, all 46GB of ram I allocate to Java get used up, and quite quickly too.

2) Using the same preMap in this implementation, each thread is given a range of user coordinates they are responsible for. Each thread runs, and takes each coordinate it has and iterates through preMap, checking each groupID and seeing if UserID x and UserID y belong to the vector returned from the preMap. This implementation eliminates overlap that will occur between threadMaps.

The problem with this is time. Right now the program is running at a stunning rate of 1400 years to complete. The memory used wavers around 4GB to 15GB but seems to stay 'low'. Not completely sure it will stay within the limit, however, I imagine it will. There are no improvements that are apparent to me.

Hopefully these descriptions are clear and will help give insight to my problem. Thanks.


Solution

  • I would have each thread process its own Map. This means each thread can work interdependently. Once the threads have finished you can combine all the results. (Or possibly combine the results as they complete, but this may add complexity with not much advantage)

    If you are using a short I would use at a collection like TObjectIntHashMap which is more efficient for handling primitives.


    In the simple case you have short co-ordinates public static void main(String... args) throws IOException { int length = 10 * 1000 * 1000; int[] x = new int[length]; int[] y = new int[length];

      Random rand = new Random();
      for (int i = 0; i < length; i++) {
        x[i] = rand.nextInt(10000) - rand.nextInt(10000);
        y[i] = rand.nextInt(10000) - rand.nextInt(10000);
      }
    
      countPointsWithLongIntMap(x, y);
      countPointsWithMap(x, y);
    
    }
    
    private static Map<String, Short> countPointsWithMap(int[] x, int[] y) {
      long start = System.nanoTime();
      Map<String, Short> counts = new LinkedHashMap<String, Short>();
      for (int i = 0; i < x.length; i++) {
        String key = x[i] + "," + y[i];
        Short s = counts.get(key);
        if (s == null)
          counts.put(key, (short) 1);
        else
          counts.put(key, (short) (s + 1));
      }
      long time = System.nanoTime() - start;
      System.out.printf("Took %.3f seconds to use Map<String, Short>%n", time/1e9);
    
      return counts;
    }
    
    private static TIntIntHashMap countPointsWithLongIntMap(int[] x, int[] y) {
      long start = System.nanoTime();
      TIntIntHashMap counts = new TIntIntHashMap();
      for (int i = 0; i < x.length; i++) {
        int key =  (x[i] << 16) | (y[i] & 0xFFFF);
        counts.adjustOrPutValue(key, 1, 1);
      }
      long time = System.nanoTime() - start;
      System.out.printf("Took %.3f seconds to use TIntIntHashMap%n", time/1e9);
      return counts;
    }
    

    prints

    Took 1.592 seconds to use TIntIntHashMap
    Took 4.889 seconds to use Map<String, Short>
    

    If you have double co-ordinates, you need to use a two tier map.

    public static void main(String... args) throws IOException {
      int length = 10 * 1000 * 1000;
      double[] x = new double[length];
      double[] y = new double[length];
    
      Random rand = new Random();
      for (int i = 0; i < length; i++) {
        x[i] = (rand.nextInt(10000) - rand.nextInt(10000)) / 1e4;
        y[i] = (rand.nextInt(10000) - rand.nextInt(10000)) / 1e4;
      }
    
      countPointsWithLongIntMap(x, y);
      countPointsWithMap(x, y);
    
    }
    
    private static Map<String, Short> countPointsWithMap(double[] x, double[] y) {
      long start = System.nanoTime();
      Map<String, Short> counts = new LinkedHashMap<String, Short>();
      for (int i = 0; i < x.length; i++) {
        String key = x[i] + "," + y[i];
        Short s = counts.get(key);
        if (s == null)
          counts.put(key, (short) 1);
        else
          counts.put(key, (short) (s + 1));
      }
      long time = System.nanoTime() - start;
      System.out.printf("Took %.3f seconds to use Map<String, Short>%n", time / 1e9);
    
      return counts;
    }
    
    private static TDoubleObjectHashMap<TDoubleIntHashMap> countPointsWithLongIntMap(double[] x, double[] y) {
      long start = System.nanoTime();
      TDoubleObjectHashMap<TDoubleIntHashMap> counts = new TDoubleObjectHashMap<TDoubleIntHashMap>();
      for (int i = 0; i < x.length; i++) {
        TDoubleIntHashMap map = counts.get(x[i]);
        if (map == null)
          counts.put(x[i], map = new TDoubleIntHashMap());
        map.adjustOrPutValue(y[i], 1, 1);
      }
      long time = System.nanoTime() - start;
      System.out.printf("Took %.3f seconds to use TDoubleObjectHashMap<TDoubleIntHashMap>%n", time / 1e9);
      return counts;
    }
    

    prints

    Took 3.023 seconds to use TDoubleObjectHashMap<TDoubleIntHashMap>
    Took 7.970 seconds to use Map<String, Short>