Search code examples
rdata.tableimmutabilitymutable

Functions that return a Mutable or Immutable variables in R


I think this question is related with the concept of Mutable Vs Immutable objects in R and it might be a "beginer question". I ran into this problem with the funcion names() and the setnames() function of the package data.table. I am sure that this is the expected behaviour but for me it was quite surprising and I am sure that is not only related with names().

Imagine I have a data.table called dt with two columns a and b:

dt <- data.table(a = 1:5, b= 1:5)
oldNames <- names(dt)

If you print oldNames it obviously shows:

oldNames
[1] "a" "b"

But if you chage the names of dt with setnames():

setnames(dt,oldNames,c("aNew","bNew"))

The content of the variable oldNames has changed.

oldNames
[1] "aNew" "bNew"

I know that in Python this is the expected behaviour in some data types(the mutable ones) and not in others (the inmutable ones). In R, is there also this kind of dicotomy?

For me, the "expected" behaviour would be that the variable oldNames stores the names of the columns and it doesn't depend on the future changes of the data.table. For example, with the length() function this doesn't happen:

L <- length(dt)
L
[1] 2
dt[,c:=1:5]
L
[1] 2

Any link to some good information about this behaviour or explanation would be really appreciated and what would be the way to code so oldNames doesn't change its content after the dt modification.


Solution

  • I believe that this is due to the implementation of data.table package. In R, multiple symbols can point to the same thing, but it usually does not cause problem since objects will be copied upon modification, for example:

    a <- c(1,2)
    b <- a
    # To check the memory address,
    # `a` and `b` are pointed to the same object since `b` does not modify `a`.
    pryr::address(a)
    [1] "0x1a735620"
    pryr::address(b)
    [1] "0x1a735620"
    # Then we modify `a`
    a <- c(1,3)
    # We will notice that the address of `a` has changed
    # since there are modifications, but `b` not.
    pryr::address(a)
    [1] "0x1a72f168"
    pryr::address(b)
    [1] "0x1a735620"
    

    From my limited knowledge, the data.table package is a little special, since it will modify the object in place with some operations. See:

    dt <- data.table(a = 1:5, b= 1:5)
    n1 <- names(dt)
    
    pryr::address(n1)
    [1] "0x18aeffe0"
    
    setnames(dt, c("a","b"), c("aa","bb"))
    
    n2 <- names(dt)
    pryr::address(n2) # identical to the address of `n1`
    [1] "0x18aeffe0"
    
    n1
    [1] "aa" "bb"
    

    I think the data.table package did not recognize that there is a variable pointed to its name attribute, thus causing the problem. I would consider it as a bug and you may want to tag the question with "data.table".

    In the current time, you can use n <- c(names(dt)) to store the names, in this way, R will consider c() modified the name attribute and store it in a different memory address.

    By the way, R do have mutable objects, see Reference class and R6 objects;-)

    Regards;

    Update :

    See ?data.table::copy and ?data.table::setnames

    To quote from ?data.table::copy:

    A ‘copy()’ may be required when doing ‘dt_names = names(DT)’. Due to R's copy-on-modify, ‘dt_names’ still points to the same location in memory as ‘names(DT)’. Therefore modifying ‘DT’ by reference now, say by adding a new column, ‘dt_names’ will also get updated. To avoid this, one has to explicitly copy: ‘dt_names <- copy(names(DT))’.

    They are of course not common in R, data.table can do this because it uses R's C interface.

    Session Info:

    > sessionInfo()
    R version 3.3.2 (2016-10-31)
    Platform: x86_64-suse-linux-gnu (64-bit)
    Running under: openSUSE Tumbleweed
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
     [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
    [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] pryr_0.1.2          data.table_1.9.6    magrittr_1.5        personalutils_0.1.0
    
    loaded via a namespace (and not attached):
    [1] tools_3.3.2      Rcpp_0.12.9      stringi_1.1.2    codetools_0.2-15
    [5] stringr_1.1.0    chron_2.3-47