I want to write controllers logic and handle json and databases in Golang, while having my math processing model in C. In my opinion overhead calling C function have to be as low, as as setting registers rcx, rdx, rsi, rdi, doin some fastcall and getting out rax value. But i've heard of big overhead in cgo
Say it I have common fastcall x64 c function int64 f(int64 a,b,c,d){return a+b+c+d}
How can I call it from go, to get a highest potential benchmark score in go testing.B
benchmark?
PS no pointer passing, no tricks, just interested in how to access C interface in most robust way
In my opinion overhead calling C function have to be as low, as as setting registers rcx, rdx, rsi, rdi, doin some fastcall and getting out rax value. But i've heard of big overhead in cgo <…>
Your opinion is unfounded.
The reason calling from Go to C has noticeable overhead is due to the following reasons.
While not in any way required by the language, a typical C program compiled by a typical compiler and running on a typical OS as a regular process, heavily relies on the OS to carry out certain aspects of its runtime environment.
The supposedly most visible and important aspect is the stack: the kernel is responsible for setting it up after loading and initializing the program's image and before transferring execution to the entry point of the code of the newborn process.
Another crucial point is that, again, while not strictly required, most C programs rely on OS-native threads to implement multiple concurently executing flows through the program's code.
The function calls performed in the C code are typically compiled using the same ABI the target combination of the OS and hardware implement (unless, of course, the programmer had explicitly managed to tell the compiler to do otherwise—like, say, marking a specific function as having a different calling convention).
C has no automatic means of managing non-stack memory ("the heap").
Such management is typically done via the C's standard library functions of the malloc(3)
family.
These functions manage the heap and consider any memory allocated through them as "theirs" (which is quite logical).
C does not provide automatic garbage collection.
Let's recap: a typical program compiled from C: uses the OS-supplied threads and uses OS-supplied stacks in those threads; function calls most of the time follow the platform's ABI; heap memory is managed by a special library code; no GC.
The points above naturally lead to goroutines having their own stacks which are completely independent of those provided by the OS for its threads.
The heap memory is managed by the Go runtime, automatically, and its done directly, no C stdlib is used for this.
Go has GC, and this GC is concurrent in that it runs completely concurrently with the goroutines executing the program's code.
The stacks used by goroutines are allocated on the heap using the memory manager provided by the Go runtime. Unlike C, these stacks are reallocatable².
Let's recap: goroutines have their own stacks, use calling convention not compatible with neither the platform's ABI nor that of C, and may be executing on different OS threads at different points of their execution.
The Go runtime manages the heap memory directly (this includes the stacks of the goroutines) and has a fully-concurrent GC.
As you should supposedly see by now, the "worlds" of runtime environments in which the Go and C code runs are different enough to have big "impedance mismatch" which requires certain gatewaying when doing FFI—with non-zero cost.
In particular, when the Go code is about to call into C, the following must be done:
cgo
machinery must verify that any memory about to be passed to the target C call does not contain pointers to other memory blocks managed by Go, recursively—this is to allow the Go's GC to continue working concurrently.As you could probably see, there are unavoidable costs, and placing values in some CPU registers is the most negligible of those costs.
Generally, there are two vectors to attack the problem:
Make the calls to C infrequent.
That is, if each call to C carries out lenghy CPU-intensive calculations, the overhead of performing these calls may be speculated to be dwarfed by the gains of making the computations performed by these calls faster.
Write critical functions in assembly.
Go allows writing code directly in the assembly of the target H/W platform.
One "trick" which may allow you to get the best of both worlds is employing the ability of most industrial compilers to output the assembly language form of the function they compiled. So you may employ hard-core facilities provided by a C compiler such as auto-vectorisation (for SSEs) and aggressive optimisation, and then grab whatever it generated and wrap it in a thin layer of assembly which basically adapts the generated code to the native Go's ABI.
There's a host of 3rd-party Go packages which do this (say, this and that) and obviously the Go runtime does this as well.
¹ Since 1.17 Go is progressively switching to using register-based calling convention.
I have no information on whether this makes Go code compiled for particular GOOS/GOARCH
combos to follow their native ABIs or not.
Go 1.18 implements register calling convention on all supported OSes when compiled for 64-bit CPUs (or CPU modes).
² Before 1.4 goroutine stacks had even more interesting design: they could consist of multiple segments forming a linked list; when a stack wanted to grow beyond its current size, a new segment was allocated and linked to the last one. This was called "split stacks".