I have a python script.
def hello(self):
return 6
print hello()
Disassembling after compiling in CPython I get
>>> c = compile(open('hello.py').read(), 'hello.py', 'exec')
>>> import dis
>>> dis.dis(c)
1 0 LOAD_CONST 0 (<code object hello at 0x1006c9230, file "hello.py", line 1>)
3 MAKE_FUNCTION 0
6 STORE_NAME 0 (hello)
3 9 LOAD_NAME 0 (hello)
12 CALL_FUNCTION 0
15 PRINT_ITEM
16 PRINT_NEWLINE
17 LOAD_CONST 1 (None)
20 RETURN_VALUE
I'm curious how the <code object hello at 0x1006c9230 ...>
is stored inside the CPython code object. There is the co_code
function but that only prints out the bytecode instructions. If I serialize the CPython code object I get
>>> import marshal
>>> marshal.dumps(c)
'c\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00@\x00\x00\x00s\x15\x00\x00\x00d\x00\x00\x84\x00\x00Z\x00\x00e\x00\x00\x83\x00\x00GHd\x01\x00S(\x02\x00\x00\x00c\x01\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00C\x00\x00\x00s\n\x00\x00\x00d\x01\x00}\x01\x00|\x01\x00S(\x02\x00\x00\x00Ni\x06\x00\x00\x00(\x00\x00\x00\x00(\x02\x00\x00\x00t\x04\x00\x00\x00selft\x01\x00\x00\x00x(\x00\x00\x00\x00(\x00\x00\x00\x00s\x08\x00\x00\x00hello.pyt\x05\x00\x00\x00hello\x01\x00\x00\x00s\x04\x00\x00\x00\x00\x01\x06\x01N(\x01\x00\x00\x00R\x02\x00\x00\x00(\x00\x00\x00\x00(\x00\x00\x00\x00(\x00\x00\x00\x00s\x08\x00\x00\x00hello.pyt\x08\x00\x00\x00<module>\x01\x00\x00\x00s\x02\x00\x00\x00\t\x03'
I know that
def hello(self):
return 6
is stored somewhere in the dump because if I change it to return 5
, one byte in the dump switches from 6 to 5.
1) Is there a way I can access the function body from the CPython code object. The closest I can get it c.names
but that only prints out a string. I'm assuming there behind the scenes it is a PyObject that is being serialized as a string. I would also like a confirmation that the function body is indeed stored in c.names
.
2) Does marshal dump store the function as bytecode instructions or as a uncompiled literal? I'm leaning toward uncompiled literal as I searched for the opcode \x83 (RETURN_VALUE) and it only appears once in the dump. I believe this implies that there is only one return statement when there should be two: once to exit out of the function hello and once to return None for exiting the script.
Version
Python 2.7.13+ (heads/2.7:96f5020597, May 26 2017, 15:26:13)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Let's break this down.
First, let me clarify how exactly CPython stores functions. When a function is parsed, CPython stores the function's data inside of a code object. CPython uses code objects to store functions, classes, and modules. The code object which represents the function is then serialized into a specific byte code format.
The code objects of function are stored in their __code__
attributes:
>>> def foo():
pass
>>>
>>> foo.__code__
<code object foo at 0x7f8bd86ce5d0, file "<pyshell#14>", line 1>
>>>
These code objects contain various data related to the function such as the functions arguments, constants referenced(such as 1
or "Hello"
), and name. The bytecode of the function is stored in the .co_code
attribute. This is what is actually executed when CPython runs your function:
>>> def foo():
pass
>>> foo.__code__.co_code
b'd\x00\x00S' # bytecode for foo
>>>
Now that you understand the basics of what CPython does, we can address your specific questions.
Is there a way I can access the function body from the CPython code object. The closest I can get it c.names but that only prints out a string. I'm assuming there behind the scenes it is a PyObject that is being serialized as a string. I would also like a confirmation that the function body is indeed stored in c.names.
The function body is not stored in the co_name
attribute of code objects. It is stored in the .co_code
attribute as described above. You are also a little bit off in your other assumption. Technically, since all objects in Python "inherit" from PyObject
, it would be correct to say the function body is serialized a PyObject
serialized as a string. However, it'd be better to say that it is serialized as a PyStringObject
which is the specific type the represents strings.
Does marshal dump store the function as bytecode instructions or as a uncompiled literal? I'm leaning toward uncompiled literal as I searched for the opcode \x83 (RETURN_VALUE) and it only appears once in the dump. I believe this implies that there is only one return statement when there should be two: once to exit out of the function hello and once to return None for exiting the script.
It does neither. marhsal.dumps()
takes a code object, serializes the entire code object into a CPython specific format, and returns a bytes object representing the serialized code object. However, your second statement is correct. At the end of every Python script, and implicit None
is returned. This can be observed by passing an empty argument to dis.dis()
:
>>> import dis
>>> dis.dis("")
1 0 LOAD_CONST 0 (None)
3 RETURN_VALUE
>>>
I know for a fact that
<code object hello at 0x1006c9230 ...>
is not stored in the co_code attribute of the original c. This is because no matter how I change the inside of def hello() the same disassembler output is given. To be clear this is a function inside a function/script not just a function as you gave in your example.
In the case of your specific example, the variable c
is a code object which represents the module - not the function - "hello.py". And your right, the code object for the function hello()
is not in co_code
. It is stored in the the module's code object's co_consts
attribute:
>>> co = compile(open('hello.py').read(), 'hello.py', 'exec')
>>> co.co_consts
(<code object hello at 0x7fedcbd3dc00, file "hello.py", line 1>, 'hello', None)
>>>
This is because of how Python executes your code. Constants are not stored directly in a code object's bytecode. Rather, they are stored in their own separate tuple. Whenever a constant is referenced in a functions code, the actual constant is stored in co_consts
and an index which corresponds to the position of said constant in co_consts
is put in the byte code.
The reason why your disassembler output for hello()
's code object never changes is because all dis.dis()
is doing is simply display the string representation for the hello()
code object. The code object for hello()
does change when you change the code, but that change is shown by dis
. It does not display the actual changed attributes of hello()
s code object.