Search code examples
rdataframedplyrtidyversepivot-table

Update records of dataframe where row names equal to column names


Hello I have a large dataframe where I have need to update or fill in the missing values. The condition is that the correct value should be in respect to row name matching to a column name. For example:

df <- data.frame(
  
  ID = c("x", "y", "z"), 
  x = c("1", "0.45", "0.47"),
  y = c("0.45", "1", "0.65"),
  z = c("XXXX", "XXXX", "1")
)

  ID    x    y    z
1  x    1 0.45 XXXX
2  y 0.45    1 XXXX
3  z 0.47 0.65    1

Above, the correct values for xxxx should be 0.47 and 0.65, respectively. Because x col z row has 0.47. Then y col z row is 0.65.The dataframe then looks like this:

  ID    x    y    z
1  x    1 0.45 0.47
2  y 0.45    1 0.65
3  z 0.47 0.65    1

This way all the elements row wise or column wise would correspond to matching rowname and column name.

I have looked into these posts but I couldn't use them to advance my solutions:

  1. Match Column to Column Names, add value to row/column of Matches

  2. data frame set value based on matching specific row name to column name

  3. Change values of matrix where row names equal column names


Solution

  • We could do this in base R

    i1 <- upper.tri(df[-1])
    i2 <- lower.tri(df[-1])
     df[-1][i1] <- df[-1][i2]
    

    -output

    > df
      ID    x    y    z
    1  x    1 0.45 0.47
    2  y 0.45    1 0.65
    3  z 0.47 0.65    1
    

    Or use coalesce

    library(dplyr)
    df[-1][upper.tri(df[-1])] <- coalesce(na_if(df[-1][upper.tri(df[-1])], 
            "XXXX"), df[-1][lower.tri(df[-1])])
    

    Or may convert to 'long' format and then reshape back to wide format after filling

    library(dplyr)
    library(tidyr)
    library(stringr)
    df %>% 
      mutate(across(everything(),  ~  na_if(.x, "XXXX"))) %>% 
      pivot_longer(cols = -ID) %>%
      group_by(ID2 = str_c(pmin(ID, name), pmax(ID, name))) %>% 
      fill(value, .direction = "downup") %>% 
      ungroup %>% 
      select(-ID2) %>% 
      pivot_wider(names_from = name, values_from = value)
    

    -output

    # A tibble: 3 × 4
      ID    x     y     z    
      <chr> <chr> <chr> <chr>
    1 x     1     0.45  0.47 
    2 y     0.45  1     0.65 
    3 z     0.47  0.65  1