Search code examples
javaperformancearraylistduplicatesdetection

Identifying duplicate elements in a list containing 300k+ Strings


I have a list containing 305899 Strings (which is the username for a website). After I remove all the duplicates, the number goes down to 172123 Strings.

I want to find how many times a particular String (the username) is repeated in that ArrayList. I wrote a simple bubble sort type logic but it was too slow.

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
    int duplicate = 0;
    int size = userNameList.size();
    for (int i = 0; i < size - 1; i++) {
        duplicate = 0;
        for (int j = i + 1; j < size; j++) {
            if (userNameList.get(i).equals(userNameList.get(j))) {
                duplicate++;
                userNameList.remove(j);
                j--;
                size--;

            }
        }
        numberOfPosts.put(userNameList.get(i), duplicate);
    }

    return numberOfPosts;
}

Then I changed it to this:

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

    Set<String> unique = new HashSet<String>(userNameList);

    for (String key : unique) {
        numberOfPosts.put(key, Collections.frequency(userNameList, key));
    }

    return numberOfPosts;
}

This was really slow as well. When I mean slow, it would take like 30+ minutes to through the list.

Is there any other efficient way to handle this problem? Just reduce the time it takes to find and count duplicate elements?


Solution

  • Your findNumberOfPosts method is on the right track, but your implementation is doing loads of unnecessary work.
    Try this:

    private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
        Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
    
        for (String userName : userNameList) {
            Integer count = numberOfPosts.get(userName);
            numberOfPosts.put(userName, count == null ? 1 : ++count);
        }
        return numberOfPosts;
    }
    

    This should execute in a couple of seconds on most machines.