In Go, I am using int
s to represent elements of a set, sorted []int
to represent a subset, and [][]int
to represent a set of subsets. A two identical []int
slices (subsets) should not be allowed. Thus given some var setOfSubets [][]int
, I would like to detect duplicate []int
slices in setOfSubets
. What is an efficient idomatic way to detect these duplicates?
We can assume that the elements of the subset are sorted.
For my program, I will return an error when the first duplicate is detected.
As an example, a user could create setOfSubsets
setOfSubsets := make([][]int, 0)
subSetA := []int{0, 1} // 0 and 1 are elements
subSetB := []int{0, 2} // 0 and 2 are elements
subSetADuplicate := []int{0, 1} // 0 and 1 are elements
setOfSubsets = append(setOfSubsets, subSetA)
setOfSubsets = append(setOfSubsets, subSetB)
setOfSubsets = append(setOfSubsets, subSetADuplicate)
// many more subsets added before the client calls
check(setOfSubets)
then my code check(setOfSubsets [][]int) error
should detect that {0, 1}
is duplicated. In the real program, there could be thousands of elements and perhaps millions of subsets. To be clear, my question is how to best write check(setOfSubsets [][]int)
I thought of using a map with []int
as the key but I realized that is not allowed in Go. I also considered sorting the set of subsets and then checking adjacent subsets []int
slices but I am not sure that is idiomatic or efficient.
One simple approach is to sort subsets []][]int
which will make duplicate subsets adjacent to each other (there can several groups of duplicate elements). Then duplicate subsets can be found by checking if adjacent subsets are equal. The code example below returns nil
if there are no duplicates (all subsets are unique) or an error
about one duplicate found (there could be more).
func checkSubsetsForDuplicates(subsets [][]int) error {
subsetsCopy := make([][]int, len(subsets))
// This copy could be omitted if it is ok to change the order of subsets
copy(subsetsCopy, subsets)
slices.SortFunc(subsetsCopy, slices.Compare)
prevSubset := subsetsCopy[0]
for _, subset := range subsetsCopy[1:] {
if slices.Equal(prevSubset, subset) {
return fmt.Errorf(
"duplicate subsets are not allowed. The subset %v was found twice",
subset)
}
prevSubset = subset
}
return nil
}
I've benchmarked this approach using
func BenchmarkCheckSubsetsForDuplicates(b *testing.B) {
// Use combinations to get unique subsets.
subsets := combin.Combinations(25, 10) // "gonum.org/v1/gonum/stat/combin"
assert.Assert(b, len(subsets) == 3268760) // expected number of combinations
b.ResetTimer()
for i := 0; i < b.N; i++ {
b.StopTimer()
// Shuffle for an unbiased order of the subsets
rand.Shuffle(len(subsets), func(i, j int) { subsets[i], subsets[j] = subsets[j], subsets[i] })
b.StartTimer()
err := checkSubsetsForDuplicates(subsets)
assert.NilError(b, err)
}
getting results
cpu: Intel(R) Processor 5Y10 CPU @ 0.80GHz
BenchmarkCheckSubsetsForDuplicates-4 1 4126782886 ns/op
BenchmarkCheckSubsetsForDuplicates-4 1 3947514637 ns/op
BenchmarkCheckSubsetsForDuplicates-4 1 3824476894 ns/op
BenchmarkCheckSubsetsForDuplicates-4 1 3805860089 ns/op
PASS
on my 8 year old laptop. For this data, checkSubsetsForDuplicates
takes around 4 seconds. The timing results will depend of the specific data but I have satisfied "millions of subsets" as stated in the question.