Search code examples
rdataframerank

Replace NA values in a data frame with the mean values of a variable depending on the rank of another variable


I have a data frame df_1 to which I added a column 'rank' for each column using the rank function:

df_1 <- data.frame(A = c(5, 6.5, 5 , 7), B = c(0.1, 0.9, 0.7, 0.01), C = c(3.5, 3.5, 1, 2), D = c(0.1, 0.5, 0.7, 0.9))
df_1$A.rank <- rank(df_1$A, ties.method = "average")
df_1$B.rank <- rank(-df_1$B, ties.method = "average")
df_1$C.rank <- rank(df_1$C, ties.method = "average")
df_1$D.rank <- rank(-df_1$D, ties.method = "average")
print(df_1)

    A    B   C   D A.rank B.rank C.rank D.rank
1 5.0 0.10 3.5 0.1    1.5      3    3.5      4
2 6.5 0.90 3.5 0.5    3.0      1    3.5      3
3 5.0 0.70 1.0 0.7    1.5      2    1.0      2
4 7.0 0.01 2.0 0.9    4.0      4    2.0      1

I have another data frame df_2 containing scores associated with each rank:

df_2 <- data.frame(score_rank = c(1, 2, 3, 4), score = c(0.37, 0.25, 0.2, 0.18))
print(df_2)

  score_rank score
1          1  0.37
2          2  0.25
3          3  0.20
4          4  0.18

I replaced the values in the 'rank' columns of df_1 with the values in the 'score' column of df_2 based on the 'score_rank' column of df_2.

new_df_1 <- df_1[, c("A.rank", "B.rank", "C.rank", "D.rank")]
new_df_1[] <- df_2$score[match(unlist(new_df_1), df_2$score_rank)]
print(new_df_1)

  A.rank B.rank C.rank D.rank
1     NA   0.20     NA   0.18
2   0.20   0.37     NA   0.20
3     NA   0.25   0.37   0.25
4   0.18   0.18   0.25   0.37

Question: how to automatically replace the NA values in df_1 to achieve this result:

  A.rank B.rank C.rank D.rank
1   0.31   0.20   0.19   0.18
2   0.20   0.37   0.19   0.20
3   0.31   0.25   0.37   0.25
4   0.18   0.18   0.25   0.37

NA in row 1 of the column 'A.rank': mean of values in the 'score' column in 'df_2' for the ranks 1 and 2 (i.e., mean(c(0.37, 0.25))) because the value 1.5 calculated by the rank function is the average of the ranks 1 and 2.

NA in row 3 of the column 'A.rank': same value as the NA in row 1 of the column 'A.rank'.

NA in row 1 of the column 'C.rank': mean of values in the 'score' column in 'df_2' for the ranks 3 and 4 (i.e., mean(c(0.2, 0.18))) because the value 3.5 calculated by the 'rank' function is the average of the ranks 3 and 4.

NA in row 2 of the column 'C.rank': same value as the NA in row 1 of the column 'C.rank'

Update : Here is another example :

df_1 <- data.frame(A = c(5, 5, 5 , 5), B = c(0.1, 0.9, 0.7, 0.01), C = c(3.5, 3.5, 1, 2), D = c(0.1, 0.5, 0.7, 0.9))

*## Here is df_1*

    A    B   C   D A.rank B.rank C.rank D.rank
1 5 0.10 3.5 0.1    2.5      3    3.5      4
2 5 0.90 3.5 0.5    2.5      1    3.5      3
3 5 0.70 1.0 0.7    2.5      2    1.0      2
4 5 0.01 2.0 0.9    2.5      4    2.0      1


## Here is df_2:

  score_rank score
1          1  0.37
2          2  0.25
3          3  0.20
4          4  0.18


## Here is new_df_1

  A.rank B.rank C.rank D.rank
1     NA   0.20     NA   0.18
2     NA   0.37     NA   0.20
3     NA   0.25   0.37   0.25
4     NA   0.18   0.25   0.37


Results:

A.rank B.rank C.rank D.rank
1     2.5   0.20     0.19   0.18
2     2.5   0.37     0.19   0.20
3     2.5   0.25   0.37   0.25
4     2.5   0.18   0.25   0.37


1. NA in (row = 1, column = 1): mean(c(df_2[1:4, c("score")])) because 2.5 = mean(c(1, 2, 3, 4)) using the rank function
2. NA in (row = 2, column = 1): like 1)
3. NA in (row = 3, column = 1): like 1)
4. NA in (row = 4, column = 1): like 1)
5. NA in (row = 1, column = 3): mean(c(df_2[3:4, c("score")])) because 3.5 = mean(c(3, 4)) using the rank function
6. NA in (row = 2, column = 3): like 5)

Solution

  • An approach using approx

    EDIT: Use method="constant" in your case

    uL <- unique(unlist(df_1[,5:8]))
    appx <- approx(df_2$score, n=length(seq(min(uL), max(uL), .5)), 
      method="constant")
    
    data.frame(sapply(df_1[,5:8], \(x) appx$y[match(x, appx$x)]))
      A.rank B.rank C.rank D.rank
    1   0.31   0.20   0.19   0.18
    2   0.20   0.37   0.19   0.20
    3   0.31   0.25   0.37   0.25
    4   0.18   0.18   0.25   0.37
    

    New data

    df_1 <- structure(list(A = c(5L, 5L, 5L, 5L), B = c(0.1, 0.9, 0.7, 0.01
    ), C = c(3.5, 3.5, 1, 2), D = c(0.1, 0.5, 0.7, 0.9), A.rank = c(2.5, 
    2.5, 2.5, 2.5), B.rank = c(3L, 1L, 2L, 4L), C.rank = c(3.5, 3.5, 
    1, 2), D.rank = 4:1), class = "data.frame", row.names = c("1", 
    "2", "3", "4"))
    
    uL <- unique(unlist(df_1[,5:8]))
    appx <- approx(df_2$score, n=length(seq(min(uL), max(uL), .5)), 
      method="constant")
    
    data.frame(sapply(df_1[,5:8], \(x) appx$y[match(x, appx$x)]))
      A.rank B.rank C.rank D.rank
    1   0.25   0.20   0.20   0.18
    2   0.25   0.37   0.20   0.20
    3   0.25   0.25   0.37   0.25
    4   0.25   0.18   0.25   0.37