I am looking into Cpython implementation and got to learn about how python tackles operator overloading (for example comparison operators) using something like richcmpfunc tp_richcompare;
field in _typeobject
struct. Where the type is defined as typedef PyObject *(*richcmpfunc) (PyObject *, PyObject *, int);
. And so whenever there is need for PyObject
being operated by these operators it tries to call tp_richcompare
function.
My doubt is that in python we use magic functions like __gt__
etc. to override these operators. So how does python code gets converted into C code as a tp_richcompare
and is being used everywhere where we interpret any comparison operator for PyObject
.
My second doubt is kind of general version of this: How code in a particular language (here Python) to override things (operators, hash etc.) which are interpreted in another language (C in case of CPython) calls the function defined in first language (Python). As far as I know, when bytecode is generated it's a low-level instruction based representation (which is essentially array of uint8_t
).
Another example of this is __hash__
which would be defined in python but is needed in the C-based implementation of the dictionary while lookdict
. Again they use C function typedef Py_hash_t (*hashfunc)(PyObject *);
everywhere hash is needed for a PyObject
but translation of __hash__
to this C function is mysterious.
Python code is not transformed into C code. It is interpreted by C code (in CPython), but that's a completely different concept.
There are many ways to interpret a Python program, and the language reference does not specify any particular mechanism. CPython does it by transforming the each Python function into a list of virtual machine instructions, which can then be interpreted with a virtual machine emulator. That's one approach. Another one would be to just build the AST and then define a (recursive) evaluate
method on each AST node.
Of course, it would also be possible to transform the program into C code and compile the C code for future execution. (Here, "C" is not important. It could be any compiled language which seems convenient.) However, there's not much benefit to doing that, and lots of disadvantages. One problem, which I guess is the one behind your question, is that Python types don't correspond to any C primitive type. The only way to represent a Python object in C is to use a structure, such as CPython PyObject
, which is effectively a low-level mechanism for defining classes (a concept foreign to C) by including a pointer to a type object which contains a virtual method table, which contains pointers to the functions used to implement the various operations on objects of that type. In effect, that will end up calling the same functions as the interpreter would call to implement each operation; the only purpose of the compiled C code is to sequence the calls without having to walk through an interpretable structure (VM list or AST or whatever). That might be slightly faster, since it avoids a switch
statement on each AST node or VM operation, but it's also a lot bulkier, because a function call occupies a lot more space in memory than a single opcode byte.
An intermediate possibility, in common use these days, is to dynamically compile descriptions of programs (ASTs or VM lists or whatever) into actual machine code at runtime, taking into account what can be discovered about the actual dynamic types and values of the referenced variables and functions. That's called "just-in-time (JIT) compilation", and it can produce huge speedups at runtime, if it's implemented well. On the other hand, it's very hard to get it right, and discussing how to do it is well beyond the scope of a SO answer.
As a postscript, I understand from a different question that you are reading Robert Nystrom's book, Crafting Interpreters. That's probably a good way of learning these concepts, although I'm personally partial to a much older but still very current textbook, also freely available on the internet, The Structure and Interpretation of Computer Programs, by Gerald Sussman, Hal Abelson, and Julie Sussman. The books are not really comparable, but both attempt to explain what it means to "interpret a program", and that's an extremely important concept, which probably cannot be communicated in four paragraphs (the size of this answer).
Whichever textbook you use, it's important to not just read the words. You must do the exercises, which is the only way to actually understand the underlying concepts. That's a lot more time-consuming, but it's also a lot more rewarding. One of the weaknesses of Nystrom's book (although I would still recommend it) is that it lays out a complete implementation for you. That's great if you understand the concepts and are looking for something which you can tweak into a rapid prototype, but it leaves open the temptation of skipping over the didactic material, which the is most important part for someone interested in learning how computer languages work.