Search code examples
algorithmgo

How can I detect duplicate slices of ints in Go?


In Go, I am using ints to represent elements of a set, sorted []int to represent a subset, and [][]int to represent a set of subsets. A two identical []int slices (subsets) should not be allowed. Thus given some var setOfSubets [][]int, I would like to detect duplicate []int slices in setOfSubets. What is an efficient idomatic way to detect these duplicates?

We can assume that the elements of the subset are sorted.

For my program, I will return an error when the first duplicate is detected.

As an example, a user could create setOfSubsets

setOfSubsets := make([][]int, 0)

subSetA := []int{0, 1}  // 0 and 1 are elements
subSetB := []int{0, 2}  // 0 and 2 are elements
subSetADuplicate := []int{0, 1} // 0 and 1 are elements

setOfSubsets = append(setOfSubsets, subSetA)
setOfSubsets = append(setOfSubsets, subSetB)
setOfSubsets = append(setOfSubsets, subSetADuplicate)

// many more subsets added before the client calls
check(setOfSubets)

then my code check(setOfSubsets [][]int) error should detect that {0, 1} is duplicated. In the real program, there could be thousands of elements and perhaps millions of subsets. To be clear, my question is how to best write check(setOfSubsets [][]int)

I thought of using a map with []int as the key but I realized that is not allowed in Go. I also considered sorting the set of subsets and then checking adjacent subsets []int slices but I am not sure that is idiomatic or efficient.


Solution

  • One simple approach is to sort subsets []][]int which will make duplicate subsets adjacent to each other (there can several groups of duplicate elements). Then duplicate subsets can be found by checking if adjacent subsets are equal. The code example below returns nil if there are no duplicates (all subsets are unique) or an error about one duplicate found (there could be more).

    func checkSubsetsForDuplicates(subsets [][]int) error {
        subsetsCopy := make([][]int, len(subsets))
        // This copy could be omitted if it is ok to change the order of subsets
        copy(subsetsCopy, subsets)
        slices.SortFunc(subsetsCopy, slices.Compare)
        prevSubset := subsetsCopy[0]
        for _, subset := range subsetsCopy[1:] {
            if slices.Equal(prevSubset, subset) {
                return fmt.Errorf(
                    "duplicate subsets are not allowed. The subset %v was found twice",
                    subset)
            }
            prevSubset = subset
        }
        return nil
    }
    

    I've benchmarked this approach using

    func BenchmarkCheckSubsetsForDuplicates(b *testing.B) {
        // Use combinations to get unique subsets.
        subsets := combin.Combinations(25, 10)    //    "gonum.org/v1/gonum/stat/combin"
        assert.Assert(b, len(subsets) == 3268760) // expected number of combinations
    
        b.ResetTimer()
        for i := 0; i < b.N; i++ {
            b.StopTimer()
            // Shuffle for an unbiased order of the subsets
            rand.Shuffle(len(subsets), func(i, j int) { subsets[i], subsets[j] = subsets[j], subsets[i] })
            b.StartTimer()
            err := checkSubsetsForDuplicates(subsets)
            assert.NilError(b, err)
        }
    

    getting results

    cpu: Intel(R) Processor 5Y10 CPU @ 0.80GHz
    BenchmarkCheckSubsetsForDuplicates-4                  1        4126782886 ns/op
    BenchmarkCheckSubsetsForDuplicates-4                  1        3947514637 ns/op
    BenchmarkCheckSubsetsForDuplicates-4                  1        3824476894 ns/op
    BenchmarkCheckSubsetsForDuplicates-4                  1        3805860089 ns/op
    PASS
    

    on my 8 year old laptop. For this data, checkSubsetsForDuplicates takes around 4 seconds. The timing results will depend of the specific data but I have satisfied "millions of subsets" as stated in the question.