Search code examples
sortinggoduplicates

Fastest way to remove duplicates of a sorted Go slice


    mySlice := make([]uint32, 0, 4294967290)

// ...

        // Sort the slice
        sort.Slice(mySlice, func(i, j int) bool {
            x := mySlice[i]
            y := mySlice[j]
            return x < y
        })

What's the fastest way to remove slice duplicates?

How can I take advantage of the fact that slice is already sorted?

Update

I came up with this:

func RemoveDuplicates(s []uint32) {
    if len(s) < 2 {
        return
    }
    tmp := make([]uint32, 0, len(s))

    for i := uint32(0); i < uint32(len(s)); i++ {
        // If current is not equal to next then store the current
        if s[i] != s[i+1] {
            tmp = append(tmp, s[i])
        }
    }

    // The last must be stored
    // Note that if it was repeated, the duplicates are NOT stored before
    tmp = append(tmp, s[len(s)-1])

    // Modify original slice
    s = nil
    s = append(s, tmp...)
}

Any mistake? Any bug? Any way to improve?

Update

As noted by @mh-cbon the correct loop max should be i < uint32(len(s)) - 1:

for i := uint32(0); i < uint32(len(s)) - 1; i++ {

Solution

  • not an answer as to the fastest, rather a step by step about the methodology to apply using the Go language to optimize a piece of code.

    For a very formal insight of what is the fastest, see https://stackoverflow.com/a/6072100/4466350

    Your code is buggy. Always write a test.

    First, let use write a main

    package main
    
    import (
        "math/rand"
        "sort"
    )
    
    func main() {
    }
    
    func randSlice(max int) (ret []uint32) {
        // we should check that max does not exceed maxUINT32
        ret = make([]uint32, 0, max)
        r := rand.New(rand.NewSource(99))
        for i := 0; i < max; i++ {
            ret = append(ret, uint32(r.Intn(max)))
        }
        sort.Slice(ret, func(i, j int) bool {
            return ret[i] < ret[j]
        })
        return
    }
    
    func dedup1(s []uint32) []uint32 {
        if len(s) < 2 {
            return s
        }
        tmp := make([]uint32, 0, len(s))
    
        for i := uint32(0); i < uint32(len(s)); i++ {
            // If current is not equal to next then store the current
            if s[i] != s[i+1] {
                tmp = append(tmp, s[i])
            }
        }
    
        // The last must be stored
        // Note that if it was repeated, the duplicates are NOT stored before
        tmp = append(tmp, s[len(s)-1])
    
        // Modify original slice
        s = nil
        s = append(s, tmp...)
        return s
    }
    
    

    And the accompanying test

    package main
    
    import "testing"
    
    func TestDedup1(t *testing.T) {
        s := randSlice(10)
        res := dedup1(s)
        uniq := map[uint32]bool{}
        for _, r := range res {
            _, ok := uniq[r]
            if ok {
                t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
            }
            uniq[r] = true
        }
    }
    

    running this we get

    $ go test -v . 
    === RUN   TestDedup1
    --- FAIL: TestDedup1 (0.00s)
    panic: runtime error: index out of range [10] with length 10 [recovered]
        panic: runtime error: index out of range [10] with length 10
    
    goroutine 18 [running]:
    testing.tRunner.func1.1(0x536680, 0xc0000da040)
        /home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1076 +0x30d
    testing.tRunner.func1(0xc000082600)
        /home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1079 +0x41a
    panic(0x536680, 0xc0000da040)
        /home/mh-cbon/.gvm/gos/go1.15.2/src/runtime/panic.go:969 +0x175
    test/d/dup.dedup1(0xc000094060, 0xa, 0xa, 0xa, 0x6124a0, 0xc00003c770)
        /home/mh-cbon/gow/src/test/d/dup/main.go:32 +0x248
    test/d/dup.TestDedup1(0xc000082600)
        /home/mh-cbon/gow/src/test/d/dup/main_test.go:7 +0x70
    testing.tRunner(0xc000082600, 0x54fbf0)
        /home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1127 +0xef
    created by testing.(*T).Run
        /home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1178 +0x386
    FAIL    test/d/dup  0.006s
    FAIL
    

    we fix this by appropriately checking for the slice bounds.

    In dedup1, change this condition if s[i] != s[i+1] { to if i+1 < uint32(len(s)) && s[i] != s[i+1] {, or even better, reduce iteration max value by one for i := uint32(0); i < uint32(len(s))-1; i++ {

    Next, write a function to generate a slice with random duplicates.

    func randSliceWithDups(max int) (ret []uint32) {
        ret = randSlice(max / 2)
        ret = append(ret, ret...)
        sort.Slice(ret, func(i, j int) bool {
            return ret[i] < ret[j]
        })
        return
    }
    

    writing randSliceWithDups(50) get slices such as [0 0 1 1 2 2 3 3 3 3 4 4 5 5 5 5 6 6 6 6 7 7 8 8 9 9 9 9 12 12 13 13 18 18 18 18 18 18 19 19 20 20 20 20 21 21 22 22 24 24]

    update the tests again

    func TestDedup1_with_dups(t *testing.T) {
        s := randSliceWithDups(10)
        res := dedup1(s)
        uniq := map[uint32]bool{}
        for _, r := range res {
            _, ok := uniq[r]
            if ok {
                t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
            }
            uniq[r] = true
        }
    }
    

    Next, add a benchmark; It will help us spot performance issue and maintain performance over time.

    func BenchmarkDedup1_1000(b *testing.B) {
        s := randSliceWithDups(100)
        b.ResetTimer()
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            _ = dedup1(s)
        }
    }
    

    running this we get :

    $ go test -v . -bench=.
    === RUN   TestDedup1
    --- PASS: TestDedup1 (0.00s)
    === RUN   TestDedup1_with_dups
    --- PASS: TestDedup1_with_dups (0.00s)
    goos: linux
    goarch: amd64
    pkg: test/d/dup
    BenchmarkDedup1_1000
    BenchmarkDedup1_1000-4        172087          6353 ns/op        5504 B/op          2 allocs/op
    PASS
    ok      test/d/dup  1.174s
    

    Let us state the obvious every one has spotted reading your initial code without even writing a benchmark, you are allocating.

    That raises the question as to figure out if you are allowed to modify the input slice in place or not. If you can change it in place, we might take advantage of this to prevent that allocations and speed up your program.

    One solution, wrote from scratch (consider search on geekforgeeks like website for a generally accepted solution), is to iterate over the slice and maintain an index of the next position to write. When a non duplicate is found, save the non duplicate to this last position, then increment that index by one. Finally, return the slice up to the value of the incremented indice.

    func dedup2(s []uint32) []uint32 {
        if len(s) < 2 {
            return s
        }
    
        var e int
        for i := 1; i < len(s); i++ {
            if s[i] == s[i-1] {
                continue
            }
            s[e] = s[i]
            e++
        }
    
        return s[:e]
    }
    

    Again, add tests and benchs, and check for the result.

    func TestDedup2(t *testing.T) {
        s := randSlice(10)
        res := dedup2(s)
        uniq := map[uint32]bool{}
        for _, r := range res {
            _, ok := uniq[r]
            if ok {
                t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
            }
            uniq[r] = true
        }
    }
    
    func TestDedup2_with_dups(t *testing.T) {
        s := randSliceWithDups(10)
        res := dedup2(s)
        uniq := map[uint32]bool{}
        for _, r := range res {
            _, ok := uniq[r]
            if ok {
                t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
            }
            uniq[r] = true
        }
    }
    
    func BenchmarkDedup2_1000(b *testing.B) {
        s := randSliceWithDups(100)
        b.ResetTimer()
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            _ = dedup2(s)
        }
    }
    

    Which yields:

    $ go test -v . -bench=.
    === RUN   TestDedup1
    --- PASS: TestDedup1 (0.00s)
    === RUN   TestDedup1_with_dups
    --- PASS: TestDedup1_with_dups (0.00s)
    === RUN   TestDedup2
    --- PASS: TestDedup2 (0.00s)
    === RUN   TestDedup2_with_dups
    --- PASS: TestDedup2_with_dups (0.00s)
    goos: linux
    goarch: amd64
    pkg: test/d/dup
    BenchmarkDedup1_1000
    BenchmarkDedup1_1000-4       1764574           673 ns/op         544 B/op          2 allocs/op
    BenchmarkDedup2_1000
    BenchmarkDedup2_1000-4       7758907           152 ns/op           0 B/op          0 allocs/op
    PASS
    ok      test/d/dup  3.224s
    

    a 4x improvement ! cool ! What s next ? Next is to find an even better algorithm which produces less executions, less lookup and so on.

    Though, the last version contains a bug ! Have you spot it ?

    See this test, which is better than the other because it does not rely on random numbers, but on static values with strong equality checks. Using those kind of tests you can tailor made your input to check for fine grained situation.

    func TestDedup2_static(t *testing.T) {
        type expectation struct {
            input []uint32
            want  []uint32
        }
    
        expectations := []expectation{
            expectation{
                input: []uint32{0, 0, 1, 2, 3, 3, 3, 4, 4, 5},
                want:  []uint32{0, 1, 2, 3, 4, 5},
            },
            expectation{
                input: []uint32{0, 1, 2, 3, 3, 3, 4, 4, 5},
                want:  []uint32{0, 1, 2, 3, 4, 5},
            },
        }
    
        for _, e := range expectations {
            res := dedup2(e.input)
            if !reflect.DeepEqual(res, e.want) {
                t.Fatalf("invlaid result, wanted=%#v\ngot=%#v\n", e.want, res)
            }
        }
    }
    

    It uses table drive testing as described at https://dave.cheney.net/2013/06/09/writing-table-driven-tests-in-go

    Lets fix this:

    func dedup2(s []uint32) []uint32 {
        if len(s) < 2 {
            return s
        }
    
        var e int = 1
        for i := 1; i < len(s); i++ {
            if s[i] == s[i-1] {
                continue
            }
            s[e] = s[i]
            e++
        }
    
        return s[:e]
    }