Search code examples
rrankimputation

Impute missing values in partial rank data?


I have some rank data with missing values. The highest ranked item was assigned a value of '1'. 'NA' values occur when the item was not ranked.

# sample data
df <- data.frame(Item1 = c(1,2, NA, 2, 3), Item2 = c(3,1,NA, NA, 1), Item3 = c(2,NA, 1, 1, 2))

> df
  Item1 Item2 Item3
1     1     3     2
2     2     1    NA
3    NA    NA     1
4     2    NA     1
5     3     1     2

I would like to randomly impute the 'NA' values in each row with the appropriate unranked values. One solution that would meet my goal would be this:

> solution1
  Item1 Item2 Item3
1     1     3     2
2     2     1     3
3     3     2     1
4     2     3     1
5     3     1     2

This code gives a list of possible replacement values for each row.

# set max possible rank in data
max_val <- 3 

# calculate row max
df$row_max <- apply(df, 1, max, na.rm= T) 

# calculate number of missing values in each row
df$num_na <- max_val - df$row_max 

# set a sample vector
samp_vec <- 1:max_val # set a sample vector

# set an empty list
replacements <- vector(mode = "list", length = nrow(df))
 
# generate a list of replacements for each row
for(i in 1:nrow(df)){
  
  if(df$num_na[i] > 0){
    replacements[[i]] <- sample(samp_vec[samp_vec > df$row_max[i] ], df$num_na[i])
  } else {
    replacements[[i]] <- NULL
  }
  
}

Now puzzling over how I can assign the values in my list to the missing values in each row of my data.frame. (My actual data has 1000's of rows.)

Is there a clean way to do this?


Solution

  • A base R option using apply -

    set.seed(123)
    
    df[] <- t(apply(df, 1, function(x) {
      #Get values which are not present in the row
      val <- setdiff(seq_along(x), x)
      #If only 1 missing value replace with the one which is not missing
      if(length(val) == 1) x[is.na(x)] <- val
      #If more than 1 missing replace randomly
      else if(length(val) > 1) x[is.na(x)] <- sample(val)
      #If no missing replace the row as it is
      x
    }))
    df
    
    #  Item1 Item2 Item3
    #1     1     3     2
    #2     2     1     3
    #3     2     3     1
    #4     2     3     1
    #5     3     1     2