Search code examples
javaarraylisthashmapperformanceremoveall

Which is more efficient : using removeAll() or using the following HashMap technique to retain only changed records in an ArrayList


I have 2 ArrayLists A and B of the same datastructure C (hashCode() and equals() overridden). C represents a student's record. The two lists are of the same size and represent new student records and old ones respectively (the students are the same in both the lists, ordering might be different). I wish to keep only those records in A that have been changed. As such, I do :

 A.removeAll(B)

As per the javadocs, this would take each record of A and compare with each record of B, and if it finds both equal, it will remove the record from A. If a record of A is not found to be equal to any record in B, and since all students in A are also in B, it means that that record of A has changed. The problem is that its easily of n square complexity.

Another approach can be :

Map<C> map = new HashMap<C>();
for (C record : B){
    map.add(record.getStudentId(),record);
}
List<C> changedRecords = new ArrayList<C>();
for (C record : A){
    if (record.equals(map.get(record.getStudentId())){
        changedRecords.add(record);
    }
}

I think this might be of a lower complexity than the above solution. Is that correct ?


Solution

  • Yes the latter algorithm is better than O(n^2), since you have two loops, one ranging over B and another over A and you do (amortized) constant work in each loop, your new solution runs in O(|A| + |B|).

    I suspect that you don't have any duplicate entries though. If this is the case, you could also go via a HashSet (change to LinkedHashSet if you want to preserve the order in A):

    HashSet<C> tmp = new HashSet<C>(A);
    tmp.removeAll(B);                     // Linear operation
    A = new ArrayList<C>(tmp);
    

    (Or if order doesn't matter to you, you could use HashSets all the way through.)


    As pointed out by @Daud in the comments below, HashSet.removeAll(Collection c) actually calls c.contains repeatedly if the size of the hash set is smaller than the collection which affects the complexity (at least in OpenJDK). This is because the implementation always chooses to iterate over the smaller collection.