Search code examples
algorithmkotlinaws-lambdacollections

Efficient way to compare and partition two collections in Kotlin


I'm currently working on a Kotlin project to automate the upkeep of emergency location data. As part of that project I need to compare two collections of Json objects to determine what data needs to be added/updated/deleted. Collection A is the collection of existing data. Collection B is the incoming data from the job. While the objects in the collections are not identical, they have a unique id that can be used to link them.

For each item that exists in A but not in B I need to make a delete call. For each item that is in B but not in A I need to do a create. For each item that exists in A and B I need to do an update.

I need to figure out a way to determine the needed operations, each of which will involve an HTTP request to a 3rd party API, and execute the needed operations as efficiently as possible. I know I could simply brute force the problem by iterating over all the items in each collection. However since this is going to be part of an AWS Lambda, I don't think that is going to cut it.

What is the most efficient approach to a problem like this using Kotlin?

Update: I ended up going with the solution from @broot however it required an extra step. Since the objects in each list are from two different classes, I needed to create an interface to define their common properties. When I was done my classes looked like

interface HasUid{
    val uid:String
}
data class OldData(val id:String, val name:String, val mac:String):HasUid {
    override val uid: String
        get() = name
}

data class NewData(val identifier:String, val display_name:String, val mac:String):HasUid {
    override val uid: String
        get() = display_name
}

and I called the function with

val(toDelete,toAdd,toUpdate) = diffSetsBy(oldDataList, newDataList, HasUid::uid)
        println("to update: " + toUpdate.size)
        println("to add: " + toAdd.size)
        println("to delete: " + toDelete.size)

Solution

  • If I understand the question correctly, it is simply about comparing two sets of objects that are stored already in the memory, by some kind of a key. There isn't really too much rocket science or performance optimizations we can involve here. The only optimization is to not compare each item with each item (that would be O(N*M)), but instead, create a hash set/map and search in it (O(N+M)):

    fun <T : Any> diffSetsBy(setA: Iterable<T>, setB: Iterable<T>, keySelector: (T) -> Any?): SetsDiff<T> {
        val onlyA = setA.associateByTo(mutableMapOf(), keySelector)
        val onlyB = mutableListOf<T>()
        val both = mutableListOf<T>()
    
        for (b in setB) {
            if (onlyA.remove(keySelector(b)) != null) {
                both += b
            } else {
                onlyB += b
            }
        }
        return SetsDiff(onlyA.values, onlyB, both)
    }
    
    data class SetsDiff<T>(val onlyA: Iterable<T>, val onlyB: Iterable<T>, val both: Iterable<T>)
    

    Algorithm requires that keys are not duplicated in a single set. I used a traditional imperative code as I don't see a way to make the code more functional or idiomatic and keep the readability. Also, as I created a generic diffSetsBy function, it would probably make sense to return pairs of objects for both, but we don't need it for our case and I was too lazy.

    We can use it like this:

    fun main() {
        val users = listOf(User("1", "James"), User("2", "Jonh"))
        val newUsers = listOf(User("2", "John"), User("3", "Dave"))
    
        val (toDelete, toAdd, toUpdate) = diffSetsBy(users, newUsers, User::login)
        println(toDelete) // [User(login=1, name=James)]
        println(toAdd) // [User(login=3, name=Dave)]
        println(toUpdate) // [User(login=2, name=John)]
    }
    
    data class User(val login: String, val name: String)