Search code examples
javacollectionsduplicates

Efficiently remove duplicate Strings from List by applying a custom comparison (not implementing equals)


I can only give a simple, (hopefully not too) abstract example of a much more complex use case. I'm working with a 3rd party library. I have a List<SomeLibraryClass> I want to remove duplicates from. SomeLibraryClass generates a String property. Duplicates, in my case, are all instances of that class where the properties contain the same characters, e.g. "abc" == "bca" == "cab". The class doesn't implement equals. There is a method to compare the instances, which conforms to the Comparable interface. It returns -1, 0 or 1, and it's the only way to identify duplicates (can't be done by means of string manipulation and comparison).

So I know how to do the comparison, but not where to put it. I don't want to override SomeLibraryClass#equals. Therefore, anything like new HashSet<>(myList) won't help. I looked at Collection#removeIf, which I believe would at best be inefficient (please correct me if I'm wrong). I also tried the merge function of Collectors#toMap, but didn't get very far.

A source code example:

public static void main(String[] args) {
        List<SomeLibraryClass> myObjs = List.of(
                new SomeLibraryClass(), // generates "abc"
                new SomeLibraryClass(), // generates "bca"
                new SomeLibraryClass(), // generates "def"
                new SomeLibraryClass(), // generates "def"
                new SomeLibraryClass(), // generates "fed"
                new SomeLibraryClass()  // generates "ghi"
        );

        // remove duplicates by comparing Strings ignoring the order of their characters

        assert(myObjs.size() == 3); // abc, def, and ghi equivalent, respectively
    }
class SomeLibraryClass {
    private String someProperty;

    SomeLibraryClass() {
        // something
    }

    public String getSomeProperty() {
        return someProperty;
    }

    // no equals
}

I would prefer a solution using the Stream API, because it's usually considered most efficient.


Solution

  • You can just use TreeSet using your own Comparator.

    Set<SomeLibraryClass> distint = myObjs.stream()
      .collect(Collectors.toCollection(
               () -> new TreeSet<SomeLibraryClass>(
                        // this is for duplicates, insert your Comparator here
                        Comparator.comparing(SomeLibraryClass::getSomeProperty)
                     )
              )
       );