Search code examples
rmatrixrep

How to create a matrix with different repeats of values in a vector


I have a really large data set, so I am trying to summarize my question with a small example below.

Lets say I have a 3X3 matrix named X, with column names a, b, and c.

X = (1, 10, 0.1,
     2, 20, 0.2,
     3, 30, 0.3)

where a = c(1, 2, 3) gives the numbers of times to repeat, b = c(10, 20, 30) gives the actual values to repeat, and c = c(0.1, 0.2, 0.3) gives values to fill out if the number of times in a is less than 4 (the number columns of matrix Y).

My goal is to generate a 3X4 matrix Y, which should be like this

Y = (10, 0.1, 0.1, 0.1,
     20,  20, 0.2, 0.2,
     30,  30,  30, 0.3)

I understand that there might be many ways to do this example, but as my real data is really large (X has a million rows, and Y has 480 columns), I really have to do this without loops (like 480 iterations). I have tried using the function rep, but still could not do this.


Solution

  • Solution

    It wasn't easy, but I figured out a way to accomplish this task using a single vectorized call to rep(), plus some scaffolding code:

    XR <- 3;
    YC <- 4;
    X <- matrix(c(1:XR%%(YC+1),seq(10,by=10,length.out=XR),seq(0.1,by=0.1,length.out=XR)),XR,dimnames=list(NULL,c('rep','val','fill')));
    X;
    ##      rep val fill
    ## [1,]   1  10  0.1
    ## [2,]   2  20  0.2
    ## [3,]   3  30  0.3
    Y <- matrix(rep(t(X[,c('val','fill')]),times=c(rbind(X[,'rep'],YC-X[,'rep']))),XR,byrow=T);
    Y;
    ##      [,1] [,2] [,3] [,4]
    ## [1,]   10  0.1  0.1  0.1
    ## [2,]   20 20.0  0.2  0.2
    ## [3,]   30 30.0 30.0  0.3
    

    (Minor point: I opted to assign column names rep val fill to X, rather than a b c as specified in the question, and I used those column names in my solution when indexing X (rather than using numeric indexes), for the reason that I generally prefer maximizing human-readability wherever possible, but this detail is negligible with respect to correctness and performance of the solution.)

    Performance

    This actually has a significant performance benefit over @josilber's solution, because he uses apply() which internally loops over the rows of the matrix (traditionally called a "hidden loop" in R-speak), whereas the core of my solution is a single vectorized call to rep(). I don't say this to knock @josilber's solution, which is a good one (and I even gave him an upvote!); it's just not the best possible solution for this problem.

    Here's a demo of the performance benefit using the hefty parameters you indicated in your question:

    XR <- 1e6;
    YC <- 480;
    X <- matrix(c(1:XR%%(YC+1),seq(10,by=10,length.out=XR),seq(0.1,by=0.1,length.out=XR)),XR,dimnames=list(NULL,c('rep','val','fill')));
    X;
    ##        rep  val fill
    ##   [1,]   1   10  0.1
    ##   [2,]   2   20  0.2
    ##   [3,]   3   30  0.3
    ##   [4,]   4   40  0.4
    ##   [5,]   5   50  0.5
    ##   [6,]   6   60  0.6
    ##   [7,]   7   70  0.7
    ##   [8,]   8   80  0.8
    ##   [9,]   9   90  0.9
    ##  [10,]  10  100  1.0
    ##  [11,]  11  110  1.1
    ##  [12,]  12  120  1.2
    ##  [13,]  13  130  1.3
    ##
    ## ... (snip) ...
    ##
    ## [477,] 477 4770 47.7
    ## [478,] 478 4780 47.8
    ## [479,] 479 4790 47.9
    ## [480,] 480 4800 48.0
    ## [481,]   0 4810 48.1
    ## [482,]   1 4820 48.2
    ## [483,]   2 4830 48.3
    ## [484,]   3 4840 48.4
    ## [485,]   4 4850 48.5
    ## [486,]   5 4860 48.6
    ## [487,]   6 4870 48.7
    ## [488,]   7 4880 48.8
    ## [489,]   8 4890 48.9
    ## [490,]   9 4900 49.0
    ## [491,]  10 4910 49.1
    ## [492,]  11 4920 49.2
    ##
    ## ... (snip) ...
    ##
    ## [999986,] 468  9999860  99998.6
    ## [999987,] 469  9999870  99998.7
    ## [999988,] 470  9999880  99998.8
    ## [999989,] 471  9999890  99998.9
    ## [999990,] 472  9999900  99999.0
    ## [999991,] 473  9999910  99999.1
    ## [999992,] 474  9999920  99999.2
    ## [999993,] 475  9999930  99999.3
    ## [999994,] 476  9999940  99999.4
    ## [999995,] 477  9999950  99999.5
    ## [999996,] 478  9999960  99999.6
    ## [999997,] 479  9999970  99999.7
    ## [999998,] 480  9999980  99999.8
    ## [999999,]   0  9999990  99999.9
    ## [1e+06,]    1 10000000 100000.0
    josilber <- function() t(apply(X,1,function(x) rep(x[2:3],c(x[1],YC-x[1]))));
    bgoldst <- function() matrix(rep(t(X[,c('val','fill')]),times=c(rbind(X[,'rep'],YC-X[,'rep']))),XR,byrow=T);
    system.time({ josilber(); });
    ##    user  system elapsed
    ##  65.719   3.828  71.623
    system.time({ josilber(); });
    ##    user  system elapsed
    ##  60.375   2.609  66.724
    system.time({ bgoldst(); });
    ##    user  system elapsed
    ##   5.422   0.593   6.033
    system.time({ bgoldst(); });
    ##    user  system elapsed
    ##   5.203   0.797   6.002
    

    And just to prove that @josilber and I are getting the exact same result, even for this large input:

    identical(bgoldst(),josilber());
    ## [1] TRUE
    

    Explanation

    Now I shall attempt to explain how the solution works. For the explanation, I'll use the following input:

    XR <- 6;
    YC <- 4;
    X <- matrix(c(1:XR%%(YC+1),seq(10,by=10,length.out=XR),seq(0.1,by=0.1,length.out=XR)),XR,dimnames=list(NULL,c('rep','val','fill')));
    X;
    ##      rep val fill
    ## [1,]   1  10  0.1
    ## [2,]   2  20  0.2
    ## [3,]   3  30  0.3
    ## [4,]   4  40  0.4
    ## [5,]   0  50  0.5
    ## [6,]   1  60  0.6
    

    for which the solution is:

    Y <- matrix(rep(t(X[,c('val','fill')]),times=c(rbind(X[,'rep'],YC-X[,'rep']))),XR,byrow=T);
    Y;
    ##      [,1] [,2] [,3] [,4]
    ## [1,] 10.0  0.1  0.1  0.1
    ## [2,] 20.0 20.0  0.2  0.2
    ## [3,] 30.0 30.0 30.0  0.3
    ## [4,] 40.0 40.0 40.0 40.0
    ## [5,]  0.5  0.5  0.5  0.5
    ## [6,] 60.0  0.6  0.6  0.6
    

    At a high level, the solution is built around forming a single vector which combines the val and fill vectors, then repeats that combined vector in a certain way, and then builds a new matrix out of the result.

    The repetition step can be done using a single call of rep() because it supports vectorized repetition counts. In other words, for a given vector input x, it can take a vector input for times which specifies how many times to repeat each element of x. Thus, the challenge just becomes constructing the appropriate x and times arguments.

    So, the solution begins by extracting the val and fill columns of X:

    X[,c('val','fill')];
    ##      val fill
    ## [1,]  10  0.1
    ## [2,]  20  0.2
    ## [3,]  30  0.3
    ## [4,]  40  0.4
    ## [5,]  50  0.5
    ## [6,]  60  0.6
    

    As you can see, since we've indexed two columns, we still have a matrix, even though we didn't specify drop=F to the index operation (see R: Extract or Replace Parts of an Object). This is convenient, as will be seen.

    In R, underneath the "matrix persona" of a matrix is really just a plain old atomic vector, and the "vector persona" of the matrix can be leveraged for vectorized operations. This is how we can pass the val and fill data to rep() and have those elements repeated appropriately.

    However, when doing this, it is important to understand exactly how the matrix is treated as a vector. The answer is that the vector is formed by following elements across rows and only then across columns. (For higher-dimensional arrays subsequent dimensions are then followed. IOW, the order of the vector is across rows, then columns, then z-slices, etc.)

    If you look carefully at the above matrix, you'll see that it cannot be used as our x argument to rep(), because the vals would be followed first, then the fills. We actually could fairly easily construct a times argument to repeat each element the correct number of times, but the resulting vector would be completely out-of-order, and there would be no way to reshape it into the desired matrix Y.

    Actually, why don't I demonstrate this quickly before moving on with the explanation:

    rep(X[,c('val','fill')],times=c(X[,'rep'],YC-X[,'rep']))
    ##  [1] 10.0 20.0 20.0 30.0 30.0 30.0 40.0 40.0 40.0 40.0 60.0  0.1  0.1  0.1  0.2  0.2  0.3  0.5  0.5  0.5  0.5  0.6  0.6  0.6
    

    Although the above vector has all the right elements in all the right repetitions, the order is such that it cannot form the desired output matrix Y.

    So, we can solve this by first transposing the extract:

    t(X[,c('val','fill')]);
    ##      [,1] [,2] [,3] [,4] [,5] [,6]
    ## val  10.0 20.0 30.0 40.0 50.0 60.0
    ## fill  0.1  0.2  0.3  0.4  0.5  0.6
    

    Now we have the val and fill vectors interleaved with one another, such that, when flattening to a vector, which will happen when we pass it as an argument to a function that internally uses it as a vector, such as we will do with rep()'s x argument, we'll get the val and corresponding fill values in the proper order for rebuilding a matrix out of them. Let me demonstrate this by explicitly flattening the matrix to a vector to show what this looks like (as you can see, this "flattening" can be done with a simple c() call):

    c(t(X[,c('val','fill')]));
    ##  [1] 10.0  0.1 20.0  0.2 30.0  0.3 40.0  0.4 50.0  0.5 60.0  0.6
    

    So, we have our x argument. Now we just need to construct the times argument.

    This was actually fairly tricky to figure out. First we can recognize that the repetition counts for the val values are provided directly in the rep column of X, so we have that in X[,'rep']. And the repetition counts for the fill values can be computed from the difference between the number of columns in the output matrix Y, which I've captured in YC, and the aforementioned repetition counts for val, or IOW, YC-X[,'rep']. The problem is, we need to interleave those two vectors to line up with our x argument.

    I am not aware of any "built-in" way to interleave two vectors in R; there doesn't appear to be any function that does it. When working on this problem, I came up with two different possible solutions for this task, one of which appears to be better in terms of both performance and concision. But since I wrote my original solution to use the "worse" one, and only later (while writing this explanation, actually) thought of the second and "better" one, I'll explain both approaches here, starting with the first and worse one.

    Interleaving Solution #1

    Interleaving two vectors can be done by combining the vectors sequentially, and then indexing that combined vector with a carefully crafted index vector which basically jumps back-and-forth from the first half to the second half of the combined vector, sequentially pulling out each element of each half in an alternating fashion.

    To construct this index vector, I begin with a sequential vector of length equal to half the length of the combined vector, with each element repeated once:

    rep(1:nrow(X),each=2);
    ##  [1] 1 1 2 2 3 3 4 4 5 5 6 6
    

    Next, I add to that a two-element vector consisting of 0 and half the length of the combined vector:

    nrow(X)*0:1;
    ## [1] 0 6
    

    The second addend is cycled through the first addend, achieving the interleaving we need:

    rep(1:nrow(X),each=2)+nrow(X)*0:1;
    ##  [1]  1  7  2  8  3  9  4 10  5 11  6 12
    

    And thus we can index the combined repetition vector to get our times argument:

    c(X[,'rep'],YC-X[,'rep'])[rep(1:nrow(X),each=2)+nrow(X)*0:1];
    ##  [1] 1 3 2 2 3 1 4 0 0 4 1 3
    

    Interleaving Solution #2

    Interleaving two vectors can also be accomplished by combining the two vectors into a matrix and then flattening them once again, in such a way that they naturally become interleaved. I believe the easiest way to do this is to rbind() them together and then flatten them immediately with c():

    c(rbind(X[,'rep'],YC-X[,'rep']));
    ##  [1] 1 3 2 2 3 1 4 0 0 4 1 3
    

    Based on some cursory performance testing, it appears solution #2 is more performant, and it can clearly be seen that it's more concise. Also, additional vectors could be tacked on very easily to the rbind() call, but there would be slightly more involved to tack on to solution #1 (a couple of increments).

    Performance testing (using the large dataset):

    il1 <- function() c(X[,'rep'],YC-X[,'rep'])[rep(1:nrow(X),each=2)+nrow(X)*0:1];
    il2 <- function() c(rbind(X[,'rep'],YC-X[,'rep']));
    identical(il1(),il2());
    ## [1] TRUE
    system.time({ replicate(30,il1()); });
    ##    user  system elapsed
    ##   3.750   0.000   3.761
    system.time({ replicate(30,il1()); });
    ##    user  system elapsed
    ##   3.810   0.000   3.815
    system.time({ replicate(30,il2()); });
    ##    user  system elapsed
    ##   1.516   0.000   1.512
    system.time({ replicate(30,il2()); });
    ##    user  system elapsed
    ##   1.500   0.000   1.503
    

    And so the full rep() call gives us our data in the proper order:

    rep(t(X[,c('val','fill')]),times=c(rbind(X[,'rep'],YC-X[,'rep'])));
    ##  [1] 10.0  0.1  0.1  0.1 20.0 20.0  0.2  0.2 30.0 30.0 30.0  0.3 40.0 40.0 40.0 40.0  0.5  0.5  0.5  0.5 60.0  0.6  0.6  0.6
    

    The last step is to build a matrix out of it, using byrow=T, because that's how the data ended up being returned from rep(). And we also must specify the required number of rows, which is the same as the input matrix, XR (alternatively, we could specify the number of columns, YC, or even both, if we wanted):

    Y <- matrix(rep(t(X[,c('val','fill')]),times=c(rbind(X[,'rep'],YC-X[,'rep']))),XR,byrow=T);
    Y;
    ##      [,1] [,2] [,3] [,4]
    ## [1,] 10.0  0.1  0.1  0.1
    ## [2,] 20.0 20.0  0.2  0.2
    ## [3,] 30.0 30.0 30.0  0.3
    ## [4,] 40.0 40.0 40.0 40.0
    ## [5,]  0.5  0.5  0.5  0.5
    ## [6,] 60.0  0.6  0.6  0.6
    

    And we're done!