Search code examples
rdplyrtidyr

Rearrange and align columns based on matching values in data frame using R


I have a data frame in R, and I want to rearrange the columns based on matching values within each column. However, in many cases, the columns don't match in such plaec in empty row. Here's an example of the initial data frame:

And here's the data frame:

ID Col1 Col2 Col3
A    X    X     
B    Y    Q    Q
B              Y
C    Z         Z
D    P    P    Q
E    R 
F    R    S         

and this is expected output

ID Col1 Col2 Col3
A    X    X     
B    Y         Y
B         Q    Q
C    Z         Z
D    P    P      
D              Q        
E    R 
F    R     
F         S         

I have aceive this using loop, but i want to do same job using either tider, dplyer, tidyverse, mutate, pivot_longer and pivot_wider, I tried but couldnt.

Could someone provide guidance or code examples on how to reorder and align columns based on matching values within rows while leaving empty cells for non-matching columns in R? Your assistance is greatly appreciated!


Solution

  • library(dplyr)
    library(tidyr) # pivot_*
    quux %>%
      pivot_longer(cols = -ID) %>%
      filter(nzchar(value)) %>%
      mutate(val2 = value) %>%
      pivot_wider(id_cols = c(ID, value), names_from = name, values_from = val2, values_fill = "") %>%
      select(-value)
    # # A tibble: 9 × 4
    #   ID    Col1  Col2  Col3 
    #   <chr> <chr> <chr> <chr>
    # 1 A     "X"   "X"   ""   
    # 2 B     "Y"   ""    "Y"  
    # 3 B     ""    "Q"   "Q"  
    # 4 C     "Z"   ""    "Z"  
    # 5 D     "P"   "P"   ""   
    # 6 D     ""    ""    "Q"  
    # 7 E     "R"   ""    ""   
    # 8 F     "R"   ""    ""   
    # 9 F     ""    "S"   ""   
    

    Data

    quux <- structure(list(ID = c("A", "B", "B", "C", "D", "E", "F"), Col1 = c("X", "Y", "", "Z", "P", "R", "R"), Col2 = c("X", "Q", "", "", "P", "", "S"), Col3 = c("", "Q", "Y", "Z", "Q", "", "")), row.names = c(NA, -7L), class = "data.frame")
    

    After a lengthy conversation, the expected behavior changes slightly: namely, when in the presence of a repeat string (within on Gene.ID and one test): repeats should be on repeated rows.

    The basic flow of this and my understanding of a working modification to the above:

    data <- structure(list(Gene.ID = c("g4260.t1", "g429.t1", "g429.t1", "g4300.t1", "g4300.t1", "g4303.t1", "g4303.t1", "g4303.t1"), HMMER = c("GH16", "GH3", "", "GH71", "", "", "GH27", "GH13"), eCAMI = c("", "GH13", "GH14", "CBM24", "CBM24", "GH27", "CBM13", "GH27"), DIAMOND = c("GH16", "", "", "GH71", "", "GH27", "CBM13", "GH13")), class = "data.frame", row.names = c(NA, -8L))
    expected_data <- structure(list(Gene.ID = c("g4260.t1", "g429.t1", "g429.t1", "g429.t1", "g4300.t1", "g4300.t1", "g4300.t1", "g4303.t1", "g4303.t1", "g4303.t1", "g4303.t1"), HMMER = c("GH16", "GH3", "", "", "GH71", "", "", "GH27", "GH13", "", ""), eCAMI = c("", "", "GH13", "GH14", "", "CBM24", "CBM24", "GH27", "", "CBM13", "GH27"), DIAMOND = c("GH16", "", "", "", "GH71", "", "", "GH27", "GH13", "CBM13", "")), class = "data.frame", row.names = c(NA, -11L))
    

    I think the output should be:

     pivot_longer(data, cols = -Gene.ID) %>%
      filter(nzchar(value)) %>%
      mutate(val2 = value, rn = row_number(), .by = c(Gene.ID, name, value)) %>%
      pivot_wider(id_cols = c(Gene.ID, value, rn), names_from = name, values_from = val2, values_fill = "") %>%
      select(-value, -rn)
    # # A tibble: 11 × 4
    #    Gene.ID  HMMER  DIAMOND eCAMI  
    #    <chr>    <chr>  <chr>   <chr>  
    #  1 g4260.t1 "GH16" "GH16"  ""     
    #  2 g429.t1  "GH3"  ""      ""     
    #  3 g429.t1  ""     ""      "GH13" 
    #  4 g429.t1  ""     ""      "GH14" 
    #  5 g4300.t1 "GH71" "GH71"  ""     
    #  6 g4300.t1 ""     ""      "CBM24"
    #  7 g4300.t1 ""     ""      "CBM24"
    #  8 g4303.t1 "GH27" "GH27"  "GH27" 
    #  9 g4303.t1 ""     "CBM13" "CBM13"
    # 10 g4303.t1 "GH13" "GH13"  ""     
    # 11 g4303.t1 ""     ""      "GH27"