Search code examples
rpass-by-referencepass-by-value

Copy-on-modify semantic on a vector does not append in a loop. Why?


This question sounds to be partially answered here but this is not enough specific to me. I would like to understand better when an object is updated by reference and when it is copied.

The simpler example is vector growing. The following code is blazingly inefficient in R because the memory is not allocated before the loop and a copy is made at each iteration.

  x = runif(10)
  y = c() 

  for(i in 2:length(x))
    y = c(y, x[i] - x[i-1])

Allocating the memory enable to reserve some memory without reallocating the memory at each iteration. Thus this code is drastically faster especially with long vectors.

  x = runif(10)
  y = numeric(length(x))

  for(i in 2:length(x))
    y[i] = x[i] - x[i-1]

And here comes my question. Actually when a vector is updated it does move. There is a copy that is made as shown below.

a = 1:10
pryr::tracemem(a)
[1] "<0xf34a268>"
a[1] <- 0L
tracemem[0xf34a268 -> 0x4ab0c3f8]:
a[3] <-0L
tracemem[0x4ab0c3f8 -> 0xf2b0a48]:  

But in a loop this copy does not occur

y = numeric(length(x))
for(i in 2:length(x))
{
   y[i] = x[i] - x[i-1]
   print(address(y))
}

Gives

[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0" 

I understand why a code is slow or fast as a function of the memory allocations but I don't understand the R logic. Why and how, for the same statement, in a case the update is made by reference and in the other case the update in made by copy. In the general case how can we know what will happen.


Solution

  • I complete the @MikeH. awnser with this code

    library(pryr)
    
    x = runif(10)
    y = numeric(length(x))
    print(c(address(y), refs(y)))
    
    for(i in 2:length(x))
    {
      y[i] = x[i] - x[i-1]
      print(c(address(y), refs(y)))
    }
    
    print(c(address(y), refs(y)))
    

    The output shows clearly what happened

    [1] "0x7872180" "2"        
    [1] "0x765b860" "1"        
    [1] "0x765b860" "1"        
    [1] "0x765b860" "1"        
    [1] "0x765b860" "1"        
    [1] "0x765b860" "1"        
    [1] "0x765b860" "1"        
    [1] "0x765b860" "1"        
    [1] "0x765b860" "1"        
    [1] "0x765b860" "1" 
    [1] "0x765b860" "2"  
    

    There is a copy at the first iteration. Indeed because of Rstudio there are 2 refs. But after this first copy y belongs in the loops and is not available into the global environment. Then, Rstudio does not create any additional refs and thus no copy is made during the next updates. y is updated by reference. On loop exit y become available in the global environment. Rstudio creates an extra refs but this action does not change the address obviously.