Search code examples
r

Confusion about modify-in-place in R


One of the exercises in Advanced R section 2.5 on modify-in-place asks why the following code doesn't create a circular list:

x <- list()
x[[1]] <- x

Advanced R Solutions says that it isn't circular due to triggering copy-on-modify. However, at the beginning of the section Wickham said that "If an object has a single name bound to it, R will modify it in place." Why does modify-in-place not then apply here? My thinking was that perhaps when you set x[[1]] <- x you are binding a second name to the object, but it seems like that would also simultaneously unbind x from the empty list because it redefines x to be the list containing the empty list as its first element. Is it that R somehow doesn't register the unbinding of x until after the binding to x[[1]], or is there something else I'm missing here?


Solution

  • I somewhat disagree with the “correct” answer given by “Advanced R Solutions”. There are two facets to the answer: the defined behaviour, and the low-level implementation details.

    The fundamental reason why your code does not create a circular reference is because R has value semantics, not reference semantics.1 In R, if you bind a value (via assignment, or by passing it to a function), that value is copied. And that’s what’s happening here: the assignment x[[1]] <- x creates a copy of x that is assigned to x[[1]].

    And this is the answer. Everything else is implementation detail. But let’s dive in:

    The issue with a “naïve” implementation of value semantics is that it’s very inefficient. In fact, a lot of code makes copies all the time, and making copies of large objects is still very inefficient, even on modern CPU architectures. We generally want to avoid copying memory as much as possible.

    So R implements an optimisation: instead of copying a value as soon as it’s assigned to something else, R just creates an additional reference to the existing value and increments its reference count.

    The reference count is literally just an integer that tells R how many places a values is referred from. When attempting to modify a value, R checks its reference count. And if the reference count is 1, R performs the modification in-place. But if the reference count is >1, R first creates a copy of the value and then modifies that copy instead. This optimisation is known as copy-on-write (CoW).

    Note how the above doesn’t talk about “names”: because whether or not these references are bound to names doesn’t matter. What matters is the number of references to a value, not the number of names it’s bound to.

    To illustrate, here’s the situation before the assignment:

    +---+      [C]-[T]----.
    | x | ---> | 1 | list |
    +---+      ·----------·
    

    x is a variable that refers to a value. Values in R carry some additional information, including the reference count (C) and a flag that specifies the type (T) — this is drastically simplified but this is all that matters here.

    Now, as soon as we assign the value to another name, we increase the reference count; e.g.: y <- x would cause this situation:

    +---+
    | x | --.
    +---+   |   [C]-[T]----.
            +-> | 2 | list |
    +---+   |   ·----------·
    | y | --·
    +---+
    

    Now the reference count is 2, and any modification to the value would necessitate a copy, e.g. x[[1]] = 5:

    +---+      [C]-[T]----.---.
    | x | ---> | 1 | list | 5 |
    +---+      ·----------·---·
    
    +---+      [C]-[T]----.
    | y | ---> | 1 | list |
    +---+      ·----------·
    

    (Note how, after the copy, the reference counts go back to 1.)

    But the situation we are talking about here is more complicated. First, the assignment triggers an increase in the reference count before anything else is done:2

    +---+      [C]-[T]----.
    | x | ---> | 2 | list |
    +---+      ·----------·
    

    At this point the reference count is increased but the reference isn’t assigned yet, because in order to assign it to something (x[[1]]), the value first needs to be modified. This happens next, and it triggers a copy, since the reference count is >1:

               [C]-[T]----.
               | 1 | list |
               ·----------·
    
    +---+      [C]-[T]----.
    | x | ---> | 1 | list |
    +---+      ·----------·
    

    The newly created copy is the one that will be modified and assigned to x, as indicated by the arrow above. The original version of the value remains unmodified, and isn’t referred to by anything in the R code just yet; but the R interpreter has an internal variable that holds this reference.3

    Now the modification of the copied value happens. First, the list is extended to make place for the new value:

               [C]-[T]----.
               | 1 | list |
               ·----------·
    
    +---+      [C]-[T]----.---.
    | x | ---> | 1 | list |   |
    +---+      ·----------·---·
    

    And then, finally, the actual assignment happens, and the reference to the original value of x is put into the newly created element:

               [C]-[T]----.
               | 1 | list |-\
               ·----------· |
                            v
    +---+      [C]-[T]----.---.
    | x | ---> | 1 | list |   |
    +---+      ·----------·---·
    

    1 Except for environments.

    2 This happens deep inside the R interpreter, and it would happen regardless of how assignment is implemented in R. That is to say, the intermediate assignment to `*tmp*` — which does happen, as shown in Roland’s answer — is not required to trigger this. Where and how exactly this happens is a bit tricky, because depending on the exact situation there are multiple places for it. For this particular situation, I believe that it happens in applydefine, which is called when the LHS of an assignment is a complex expression (as is the case here). And interestingly it happens before `*tmp*` is defined, and then R disables reference counting for `*tmp*`, so that the presence of `*tmp*` does not actually affect whether or not we are creating a copy!

    3 In fact, after the copy is created, it’s assigned to `*tmp*`. I am skipping this step here to avoid having to draw another three diagrams, and it doesn’t change the outcome.