I have a data frame in R, and I want to rearrange the columns based on matching values within each column. However, in many cases, the columns don't match in such plaec in empty row. Here's an example of the initial data frame:
And here's the data frame:
ID Col1 Col2 Col3
A X X
B Y Q Q
B Y
C Z Z
D P P Q
E R
F R S
and this is expected output
ID Col1 Col2 Col3
A X X
B Y Y
B Q Q
C Z Z
D P P
D Q
E R
F R
F S
I have aceive this using loop, but i want to do same job using either tider, dplyer, tidyverse, mutate, pivot_longer and pivot_wider, I tried but couldnt.
Could someone provide guidance or code examples on how to reorder and align columns based on matching values within rows while leaving empty cells for non-matching columns in R? Your assistance is greatly appreciated!
library(dplyr)
library(tidyr) # pivot_*
quux %>%
pivot_longer(cols = -ID) %>%
filter(nzchar(value)) %>%
mutate(val2 = value) %>%
pivot_wider(id_cols = c(ID, value), names_from = name, values_from = val2, values_fill = "") %>%
select(-value)
# # A tibble: 9 × 4
# ID Col1 Col2 Col3
# <chr> <chr> <chr> <chr>
# 1 A "X" "X" ""
# 2 B "Y" "" "Y"
# 3 B "" "Q" "Q"
# 4 C "Z" "" "Z"
# 5 D "P" "P" ""
# 6 D "" "" "Q"
# 7 E "R" "" ""
# 8 F "R" "" ""
# 9 F "" "S" ""
Data
quux <- structure(list(ID = c("A", "B", "B", "C", "D", "E", "F"), Col1 = c("X", "Y", "", "Z", "P", "R", "R"), Col2 = c("X", "Q", "", "", "P", "", "S"), Col3 = c("", "Q", "Y", "Z", "Q", "", "")), row.names = c(NA, -7L), class = "data.frame")
After a lengthy conversation, the expected behavior changes slightly: namely, when in the presence of a repeat string (within on Gene.ID
and one test): repeats should be on repeated rows.
The basic flow of this and my understanding of a working modification to the above:
data <- structure(list(Gene.ID = c("g4260.t1", "g429.t1", "g429.t1", "g4300.t1", "g4300.t1", "g4303.t1", "g4303.t1", "g4303.t1"), HMMER = c("GH16", "GH3", "", "GH71", "", "", "GH27", "GH13"), eCAMI = c("", "GH13", "GH14", "CBM24", "CBM24", "GH27", "CBM13", "GH27"), DIAMOND = c("GH16", "", "", "GH71", "", "GH27", "CBM13", "GH13")), class = "data.frame", row.names = c(NA, -8L))
expected_data <- structure(list(Gene.ID = c("g4260.t1", "g429.t1", "g429.t1", "g429.t1", "g4300.t1", "g4300.t1", "g4300.t1", "g4303.t1", "g4303.t1", "g4303.t1", "g4303.t1"), HMMER = c("GH16", "GH3", "", "", "GH71", "", "", "GH27", "GH13", "", ""), eCAMI = c("", "", "GH13", "GH14", "", "CBM24", "CBM24", "GH27", "", "CBM13", "GH27"), DIAMOND = c("GH16", "", "", "", "GH71", "", "", "GH27", "GH13", "CBM13", "")), class = "data.frame", row.names = c(NA, -11L))
I think the output should be:
pivot_longer(data, cols = -Gene.ID) %>%
filter(nzchar(value)) %>%
mutate(val2 = value, rn = row_number(), .by = c(Gene.ID, name, value)) %>%
pivot_wider(id_cols = c(Gene.ID, value, rn), names_from = name, values_from = val2, values_fill = "") %>%
select(-value, -rn)
# # A tibble: 11 × 4
# Gene.ID HMMER DIAMOND eCAMI
# <chr> <chr> <chr> <chr>
# 1 g4260.t1 "GH16" "GH16" ""
# 2 g429.t1 "GH3" "" ""
# 3 g429.t1 "" "" "GH13"
# 4 g429.t1 "" "" "GH14"
# 5 g4300.t1 "GH71" "GH71" ""
# 6 g4300.t1 "" "" "CBM24"
# 7 g4300.t1 "" "" "CBM24"
# 8 g4303.t1 "GH27" "GH27" "GH27"
# 9 g4303.t1 "" "CBM13" "CBM13"
# 10 g4303.t1 "GH13" "GH13" ""
# 11 g4303.t1 "" "" "GH27"