Search code examples
rcomparerowadditionrow-number

compare data.frames with different number for rows


I have two dataframes:

df1<- data.frame(POS=c(1, 2, 3, 4, 5),
             A=c(0.2,0,0,0.8,0),
             G=c(0,0.3,0,0.2,0.5),
             T=c(0.4,0.7,1,0,0.5),
             C=c(0.4,0,0,0,0))

df2<- data.frame(POS=c(1, 3, 4, 5, 6),
             A=c(0.3,0,0.6,0,0),
             G=c(0,0,0.4,0.7,1),
             T=c(0.2,1,0,0.3,0),
             C=c(0.5,0,0,0,0))

I'd like to get the summed squared differences for each POS

The first thing I need are two matrices of the same dimensions. How do I add a row for the missing positions filled with 0's?

Once I have the two matrices, I'll do (df1-df2)^2 followed by a rowsums approach.

To summarize: How do I add a new row filled with zeros for the missing rows in each data.frame?

like this:

POS A   G   T   C
1 0.2 0.0 0.4 0.4
2 0.0 0.3 0.7 0.0
3 0.0 0.0 1.0 0.0
4 0.8 0.2 0.0 0.0
5 0.0 0.5 0.5 0.0
6 0.0 0.0 0.0 0.0

POS A   G   T   C
1 0.3 0.0 0.2 0.5
2 0.0 0.0 0.0 0.0
3 0.0 0.0 1.0 0.0
4 0.6 0.4 0.0 0.0
5 0.0 0.7 0.3 0.0
6 0.0 1.0 0.0 0.0

Solution

  • As said by Zheyuan Li in the comments, you can use merge() of each df with the POS column of the other one df on the first column(POS). This adds missing rows which are filled with NA so you then have to fill them with zeros :

    df1 <- merge(df1, df2$POS, all = TRUE, by=1)
    df1[is.na(df1)] <- 0
    df1
      POS   A   G   T   C
    1   1 0.2 0.0 0.4 0.4
    2   2 0.0 0.3 0.7 0.0
    3   3 0.0 0.0 1.0 0.0
    4   4 0.8 0.2 0.0 0.0
    5   5 0.0 0.5 0.5 0.0
    6   6 0.0 0.0 0.0 0.0
    
    df2 <- merge(df2, df1$POS, all = TRUE, by=1)
    df2[is.na(df2)] <- 0
    df2
      POS   A   G   T   C
    1   1 0.3 0.0 0.2 0.5
    2   2 0.0 0.0 0.0 0.0
    3   3 0.0 0.0 1.0 0.0
    4   4 0.6 0.4 0.0 0.0
    5   5 0.0 0.7 0.3 0.0
    6   6 0.0 1.0 0.0 0.0