Subtracting row 2 from row 1 repeatedly

I want to create in R a column in my data set where I subtract row 2 from row1, row 4 from row 3 and so forth. Moreover, I want that the subtraction result is repeated for each row (e.g.if the result from the subtraction row2-row1 is -0.294803, I want this value to be present both in row1 and row2, hence repeated twice for both factors of the subtraction, and so forth for all subtractions).

Here my data set.

I tried with the function aggregate but I didn't succeed. Any hint?

Solution

solution 1

One way to that is with one simple loop:

(download mydata.csv)

a = read.table("mydata.csv",header=T,sep=";")
a$delta= NA
for(i in seq(1, nrow(a), by=2 )){
  a[i,"delta"] = a[i+1,"delta"] = a[i+1,"log"] - a[i,"log"]
}

What is going on here is that the for loop iterates on every odd number (that's what the seq(...,by=2) does. So for the first, third, fifth, etc. row we assign to that row AND the following one the computed difference. which returns:

> a
   su    match  log delta
1   1    match 5.80  0.30
2   1 mismatch 6.10  0.30
3   2    match 6.09 -0.04
4   2 mismatch 6.05 -0.04
5   3    match 6.42 -0.12
6   3 mismatch 6.30 -0.12
7   4    match 6.20 -0.20
8   4 mismatch 6.00 -0.20
9   5    match 5.90  0.19
10  5 mismatch 6.09  0.19

solution 2

If you have a lot of data this approach can be slow. And generally R works better with another form of iterative functions which are the apply family.

The same code of above can be optimized like this:

a$delta = rep(
             sapply(seq(1, nrow(a), by=2 ),
                 function(i){ a[i+1,"log"] - a[i,"log"] }
             ),
          each=2)

Which gives the very same result as the first solution, should be faster, but also somewhat less intuitive.

solution 3

Finally it looks to me that you're trying to use a convoluted approach by using the long dataframe format, given your kind of data. I'd reshape it to wide, and then operate more logically with separate columns, without the need of duplicate data.

Like this:

a = read.table("mydata.csv",header=T,sep=";")
a = reshape(a, idvar = "su", timevar = "match", direction = "wide")
#now creating what you want became a very simple thing:
a$delta = a[[3]]-a[[2]]

Which returns:

>a
   su log.match log.mismatch delta
1  1      5.80         6.10  0.30
3  2      6.09         6.05 -0.04
5  3      6.42         6.30 -0.12
7  4      6.20         6.00 -0.20
9  5      5.90         6.09  0.19

The delta column contains the values you need. If you really need the long format for further analysis you can always go back with:

a= reshape(a, idvar = "su", timevar = "match", direction = "long")
#sort to original order:
a = a[with(a, order(su)), ]