I have a large data frame with repeated variables. This is just a sample of my data to illustrate the question:
df <- data.frame(
ID = rep(1:4, each = 1),
CMW = rep(c(10, 20, 30, 30), each = 1),
D_D = c(rep(100, 3), 200),
D_D = c(rep(100, 3), 200),
D_D = c(rep(100, 1), 200),
Eref = rep(4:4, each = 1),
Eref = rep(4:4, each = 1),
Eref = rep(1:4, each = 1),
Eref = rep(1:4, each = 1)
)
ID CMW DD DD.1 DD.2 Eref Eref.1 Eref.2 Eref.3
1 10 100 100 100 4 4 1 1
2 20 100 100 200 4 4 2 2
3 30 100 100 100 4 4 3 3
4 30 200 200 200 4 4 4 4
R will append numbers in the variable names to make them unique but the variables that have the same "root name" (the string before dot) are actually the same. So what I am trying to do is, if the variable is repeated, look at the values within that particular variable, if the values are identical keep only one column of that variable. However if there are two set of the same variable that are identical keep one column of each set. So I want to do that with all the repeated variables in my data frame. For example from the sample of the data frame above (df) I want to have the following result:
ID CMW DD DD.1 Eref Eref.1
1 10 100 100 4 1
2 20 100 200 4 2
3 30 100 100 4 3
4 30 200 200 4 4
So far I was able to check if there are repeated variables in my data frame with this code:
duplicated_col <- unique(sub("\\.\\d+$", "", names(df))[duplicated(sub("\\.\\d+$", "", names(df)))])
But I am not sure how to compare the repeated variables and drop/keep to obtain the df_result. Any help is very welcomed. Thank you!
split.default(df, sub(".\\d+$", "", names(df))) |>
lapply(\(x)unique(as.matrix(unname(x)), MARGIN = 2)) |>
data.frame()
CMW D_D.1 D_D.2 Eref.1 Eref.2 ID
1 10 100 100 4 1 1
2 20 100 200 4 2 2
3 30 100 100 4 3 3
4 30 200 200 4 4 4
If you want to maintain the order of appearance. add another pipe:
fn <- function(x,d) x[order(match(names(x), names(d)))]
split.default(df, sub(".\\d+$", "", names(df))) |>
lapply(\(x)unique(as.matrix(unname(x)), MARGIN = 2)) |>
data.frame() |> fn(df)
ID CMW D_D.1 D_D.2 Eref.1 Eref.2
1 1 10 100 100 4 1
2 2 20 100 200 4 2
3 3 30 100 100 4 3
4 4 30 200 200 4 4