Search code examples
rmemorydocumentationcons

What do cons cells store in R?


According to the Memory{base} help page for R 4.1.0 Documentation, R keeps two separate memory areas for "fixed" and "variable" sized objects. As I understand, variable-sized objects are those the user can create in the work environment: vectors, lists, data frames, etc. However, when referring to fixed-sized objects the documentation is rather obscure:

[Fixed-sized objects are] allocated as an array of cons cells (Lisp programmers will know what they are, others may think of them as the building blocks of the language itself, parse trees, etc.)[.]

Could someone provide an example of a fixed-sized object that is stored in a cons cell? For further reference, I know the function memory.profile() gives a profile of the usage of cons cells. For example, in my session this appears like:

> memory.profile()
       NULL      symbol    pairlist     closure environment     promise    language 
          1       23363      623630        9875        2619       13410      200666 
    special     builtin        char     logical     integer      double     complex 
         47         696       96915       16105      107138       10930          22 
  character         ...         any        list  expression    bytecode externalptr 
     130101           2           0       50180           1       42219        3661 
    weakref         raw          S4 
       1131        1148        1132 

What do these counts stand for, both numerically and conceptually? For instance, does the logical: 16105 make reference to 16,105 logical objects (bytes?, cells?) that are stored in the source code/binaries of R?

My purpose is to gain more understanding about how R manages memory in a given session. Finally, I think I do understand what a cons cell is, both in Lisp and R, but if the answer to this question needs to address this concept first I think it won't hurt starting from there maybe.


Solution

  • Background

    At C level, an R object is just a pointer to a block of memory called a "node". Each node is a C struct, either a SEXPREC or a VECTOR_SEXPREC. VECTOR_SEXPREC is for vector-like objects, including strings, atomic vectors, expression vectors, and lists. SEXPREC is for every other type of object.

    The SEXPREC struct has three contiguous segments:

    1. A header spanning 8 bytes, specifying the object's type and other metadata.
    2. Three pointers to other nodes, spanning (in total) 12 bytes on 32-bit systems and 24 bytes on 64-bit systems. The first points to a pairlist of the object's attributes. The second and third point to the previous and next node in a doubly linked list traversed by the garbage collector in order to free unused memory.
    3. Three more pointers to other nodes, again spanning 12 or 24 bytes, though what these point to varies by object type.

    The VECTOR_SEXPREC struct has segments (1) and (2) above, followed by:

    1. Two integers spanning (in total) 8 bytes on 32-bit systems and 16 bytes on 64-bit systems. These specify the number of elements of the vector, conceptually and in memory.

    The VECTOR_SEXPREC struct is followed by a block of memory spanning at least 8+n*sizeof(<type>) bytes, where n is the length of the corresponding vector. The block consists of an 8-byte leading buffer, the vector "data" (i.e., the vector's elements), and sometimes a trailing buffer.

    In summary, non-vectors are stored as a node spanning 32 or 56 bytes, while vectors are stored as a node spanning 28 or 36 bytes followed by a block of data of size roughly proportional to the number of elements. Hence nodes are of roughly fixed size, while vector data require a variable amount of memory.

    Answer

    R allocates memory for nodes in blocks called Ncells (or cons cells) and memory for vector data in blocks called Vcells. According to ?Memory, each Ncell is 28 bytes on 32-bit systems and 56 bytes on 64-bit systems, and each Vcell is 8 bytes. Thus, this line in ?Memory:

    R maintains separate areas for fixed and variable sized objects.

    is actually referring to nodes and vector data, not R objects per se.

    memory.profile gives the number of cons cells used by all R objects in memory, stratified by object type. Hence sum(memory.profile()) will be roughly equal to gc(FALSE)[1L, "used"], which gives the total number of cons cells in use after a garbage collection.

    gc(FALSE)
    ##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
    ## Ncells 273996 14.7     667017 35.7         NA   414424 22.2
    ## Vcells 549777  4.2    8388608 64.0      16384  1824002 14.0
    
    sum(memory.profile())
    ## [1] 273934
    

    When you assign a new R object, the number of Ncells and Vcells in use as reported by gc will increase. For example:

    gc(FALSE)[, "used"]
    ## Ncells Vcells 
    ## 273933 549662
    
    x <- Reduce(function(x, y) call("+", x, y), lapply(letters, as.name))
    x
    ## a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + 
    ##     q + r + s + t + u + v + w + x + y + z
    
    gc(FALSE)[, "used"]
    ## Ncells Vcells 
    ## 330337 676631
    

    You might be wondering why the number of Vcells in use increased, given that x is a language object, not a vector. The reason is that nodes are recursive: they contain pointers to other nodes, which may very well be vector nodes. Here, Vcells were allocated in part because each symbol in x points to a string (+ to "+", a to "a", and so on), and each of those strings is a vector of characters. (That said, it is surprising that ~125000 Vcells were required in this case. That may be an artifact of the Reduce and lapply calls, but I'm not really sure at the moment.)

    References

    Everything is a bit scattered:

    • ?Memory, ?`Memory-limits`, ?gc, ?memory.profile, ?object.size.
    • This section of the Writing R Extensions manual for more about Ncells and Vcells.
    • This section of the R Internals manual for a complete description of the internal structure of R objects.