Search code examples
language-agnosticprogramming-languages

How are implemented classes in dynamic languages?


How are implemented classes in dynamic languages ?

I know that Javascript is using a prototype pattern (there is 'somewhere' a container of unbound JS functions, which are bind when calling them through an object), but I have no idea of how it works in other languages.

I'm curious about this, because I can't think of an efficient way to have native bound methods without wasting memory and/or cpu by copying members for each instance.

(By bound method, I mean that the following code should work :)

class Foo { function bar() : return 42; };

var test = new Foo();
var method = test.bar;
method() == 42;

Solution

  • This highly depends on the language and the implementation. I'll tell you what I know about CPython and PyPy.

    The general idea, which is also what CPython does for the most part, goes like this:

    1. Every object has a class, specifically a reference to that class object.
    2. Apart from instance members, which are obviously stored in the individual object, the class also has members. This includes methods, so methods don't have a per-object cost.
    3. A class has a method resolution order (MRO) determined by the inheritance relationships, wherein each base class occurs exactly once. If we didn't have multiple inheritance, this would simply be a reference to the base class, but this way the MRO is hard to figure out on the fly (you'd have to start from the most derived class every time).
    4. (Classes are also objects and have classes themselves, but we'll gloss over that for now.)
    5. If attribute lookup on an object fails, the same attribute is looked up on the classes in the MRO, in the order specified by the MRO. (This is the default behavior, which can be changed by defining magic methods like __getattr__ and __getattribute__.)

    So far so simple, and not really an explanation for bound methods. I just wanted to make sure we're talking about the same thing. The missing piece is descriptors. The descriptor protocol is defined in the "deep magic" section of the language reference, but the short and simple story is that lookup on a class can be hijacked by the object it results in via a __get__ method. More importantly, this __get__ method is told whether the lookup started on an instance or on the "owner" (the class).

    In Python 2, we have an ugly and unnecessary UnboundMethod descriptor which (apart from the __get__ method) simply wraps the function to throw errors on Class.method(self) if self is not of an acceptable type. In Python 3, the __get__ is simply part of all function objects, and unbound methods are gone. In both cases, the __get__ method returns itself when you look it up on a class (so you can use Class.method, which is useful in a few cases) and a "bound method" object when you look it up on an object. This bound method object does nothing more than storing the raw function and the instance, and passing the latter as first argument to the former in its __call__ (special method overriding the function call syntax).

    So, for CPython: While there is a cost to bound methods, it's smaller than you might think. Only two references are needed space-wise, and the CPU cost is limited to a small memory allocation, and an extra indirection when calling. Note though that this cost applies to all method calls, not just those which actually make use of bound method features. a.f() has to call the descriptor and use its return value, because in a dynamic language we don't know if it's monkey-patched to do something different.

    In PyPy, things are more interesting. As it's an implementation which doesn't compromise on correctness, the above model is still correct for reasoning about semantics. However, it's actually faster. Apart from the fact that the JIT compiler inlines and then eliminates the entire mess described above in most cases, they also tackle the problem on bytecode level. There are two new bytecode instructions, which preserve the semantics but omit the allocation of the bound method object in the case of a.f(). There is also a method cache which can simplify the lookup process, but requires some additional bookkeeping (though some of that bookkeeping is already done for the JIT).