c go cgo

go - `GI_pthread_mutex_unlock` takes most of the execution time when using cgo

I'm using cgo to call a C function from Go. Inside the C function there is a callback to a Go function. In other way, I'm calling Go -> C -> Go.

After running pprof, I noticed that the __GI___pthread_mutex_unlock took half of the execution time. AFAIK, cgo has an overhead, especially calling back from C to Go. But it's weird that cgo takes half of the execution time to do some locking. Is there something wrong in my code?

main.go

package main

/*
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

extern void go_get_value(uint8_t* key32, uint8_t *value32);

void get_value(uint8_t* key32, uint8_t *value32) {
    uint8_t value[32];
    go_get_value(key32, value);
    memcpy(value32, value, 32);
}
*/
import "C"
import (
    "fmt"
    "runtime"
    "sync"
    "time"
    "unsafe"

    "github.com/pkg/profile"

    _ "github.com/ianlancetaylor/cgosymbolizer"
)

func getValue(key [32]byte) []byte {
    key32 := (*C.uint8_t)(C.CBytes(key[:]))
    value32 := (*C.uint8_t)(C.malloc(32))
    C.get_value(key32, value32)
    ret := C.GoBytes(unsafe.Pointer(value32), 32)
    C.free(unsafe.Pointer(key32))
    C.free(unsafe.Pointer(value32))
    return ret
}

func main() {
    defer profile.Start().Stop()

    numWorkers := runtime.NumCPU()
    fmt.Printf("numWorkers = %v\n", numWorkers)
    numTasks := 10_000_000
    tasks := make(chan struct{}, numTasks)
    for i := 0; i < numTasks; i++ {
        tasks <- struct{}{}
    }
    close(tasks)
    start := time.Now()
    var wg sync.WaitGroup
    for i := 0; i < numWorkers; i++ {
        wg.Add(1)
        go func() {
            for range tasks {
                value := getValue([32]byte{})
                _ = value
            }
            wg.Done()
        }()
    }
    wg.Wait()
    fmt.Printf("took %vms\n", time.Since(start).Milliseconds())
}

callback.go

package main

/*
#include <stdint.h>

extern void go_get_value(uint8_t* key32, uint8_t *value32);
*/
import "C"
import (
    "unsafe"
)

func copyToCbytes(src []byte, dst *C.uint8_t) {
    n := len(src)
    for i := 0; i < n; i++ {
        *(*C.uint8_t)(unsafe.Pointer(uintptr(unsafe.Pointer(dst)) + uintptr(i))) = (C.uint8_t)(src[i])
    }
}

//export go_get_value
func go_get_value(key32 *C.uint8_t, value32 *C.uint8_t) {
    key := C.GoBytes(unsafe.Pointer(key32), 32)
    _ = key
    value := make([]byte, 32)
    copyToCbytes(value, value32)
}

Running enviroment:

lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                        0
CPU MHz:                         2200.152
BogoMIPS:                        4400.30
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        4 MiB
L3 cache:                        55 MiB
NUMA node0 CPU(s):               0-31
...
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtop
                                 ology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowpre
                                 fetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities

Golang version

go version go1.20.5 linux/amd64

Here is the pprof result:

EDIT: added running environment

Solution

Though I could not reproduce it with your program above:

 ± go1.20.5 tool pprof /tmp/profile3378726905/cpu.pprof
File: callback
Build ID: e295a7c26f8d6b18641985f09c9fa3872b3ae569
Type: cpu
Time: Jun 20, 2023 at 12:41pm (+07)
Duration: 3.08s, Total samples = 19.49s (632.73%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 11520ms, 59.11% of 19490ms total
Dropped 120 nodes (cum <= 97.45ms)
Showing top 10 nodes out of 88
      flat  flat%   sum%        cum   cum%
    2320ms 11.90% 11.90%     2320ms 11.90%  __GI___lll_lock_wake
    2170ms 11.13% 23.04%     2170ms 11.13%  futex_wait (inline)
    1750ms  8.98% 32.02%     1750ms  8.98%  runtime.procyield
    1010ms  5.18% 37.20%     1240ms  6.36%  runtime.casgstatus
     950ms  4.87% 42.07%     1060ms  5.44%  runtime.reentersyscall
     870ms  4.46% 46.54%     3480ms 17.86%  lll_mutex_lock_optimized (inline)
     740ms  3.80% 50.33%     1900ms  9.75%  runtime.mallocgc
     610ms  3.13% 53.46%     4090ms 20.99%  ___pthread_mutex_lock
     570ms  2.92% 56.39%      810ms  4.16%  runtime.exitsyscallfast
     530ms  2.72% 59.11%     2220ms 11.39%  runtime.lock2

But there's a global mutex for every callbacks, so that would kill the performance if you do parallel callbacks.

go - `__GI___pthread_mutex_unlock` takes most of the execution time when using cgo

go - `GI_pthread_mutex_unlock` takes most of the execution time when using cgo