Search code examples
javaconcurrencyjava.util.concurrent

Thread safe data structure to check for existence and write if not


I want to parse a long list of strings with duplicates and save each unique string to an array exactly once. In a multi threaded approach, threads will check a shared data structure for existence and write if it does not exist.

I forget what data structure is appropriate for this. Anything from Java.util is fine and so are high performance third party libraries.


Solution

  • The collection classes in the java.util package are not thread-safe in order to provide maximum performance in single-threaded applications. (Vector and Hashtable being exceptions)

    There are a few ways to achieve the thread safety you are looking for.

    Sychronized Wrapper Set<String> safeSet = Collections.synchronizedSet(new HashSet<>());

    This will wrap all the calls to the underlying set in in a synchronized block, locking on the object. However, That means when a thread is iterating over elements in a collection, all other collection’s methods block, causing other threads having to wait.

    java.util.concurrent Package

    Java 5 introduced concurrent collections that provide much better performance than synchronized wrappers.

    There are different flavors: copy-on-write, Compare-And-Swap, and concurrent collections.

    The concurrent collections use a special Lock that is more flexible than synchronization.

    So for what you are doing, a HashSet is probably a good match, if it was single threaded. In the concurrent package you could use ConcurrentHashMap.

    It would look like this:

    import java.util.Map;
    import java.util.concurrent.ConcurrentHashMap;
    
    ...
    
     private static final Object PRESENT = new Object();
     Map<String, Object> seenStrings = new ConcurrentHashMap<>();
    
    
    
    for ( String aString : stringList ) {
        if ( seenStrings.containsKey(aString) ) {
            // Already there
        } else {
            // Not seen yet
            seenStrings.put(aString, PRESENT);
        }
    }
    

    Update Andy's comment is a good one, I wasn't sure what you wanted to do if you had already seen an item or if you haven't.

    You could do this to ensure the check and insert are executed atomically

    if (seenStrings.put(aString, PRESENT) == null) {
           // Not seen yet
    } 
    

    Update In Java 8+, you can create a set backed by the specified map. Effectively a ConcurrentHashSet.

    Set<String> seenStrings = Collections.newSetFromMap(new ConcurrentHashMap<>());
    for (String aString : stringList) {
        if (seenStrings.add(aString)) {               
                // Not seen yet
        }
    }