I would like to replace duplicated values per site to NA and to keep the first repeated value that appears from left to right.
For example, on site "Alz-Ettelbruck" the value "7" repeats in columns 4 and 5.That means the column 5 of only that site should return NA. The number 12 repeats in all columns on site "Our-Gemund/Vianden" so I would like to keep the 12 in column 2 but the rest should be replaced to NA.
For that I have used the duplicated function but it returns "NULL".
To reproduce the issue, I have used the following data frame and I have indicated at the end the desired output.
Any help would be truly appreciated. Thank you in advance.
df <- data.frame(stringsAsFactors = FALSE,
check.names = FALSE,
Site = c("Att-Bissen","Alz-Ettelbruck","Our-Gemund/Vianden",
"Syre Felsmuhle/Mertert","Ernz Blanche-Larochette"),
`2001-12-01 to 2021-12-01` = c(12, 1, 12, 1, 8),
`1991-12-01.to 2021-12-01` = c(5, 4, 12, 6, 14),
`1981-12-01 to 2021-12-01` = c(12, 7, 12, 20, 14),
`1971-12-01 to 2021-12-01` = c(19, 7, 12, 13, 14))
# Replace repeated values with NA per row
data <- for (i in 1:nrow(df)) {
df[i, -1][duplicated(df[i, -1])] <- NA
}
The following is what I would like the script to return:
(Edit: the first version of both base-R and dplyr+tidyr code used duplicated
, which would falsely-remove the 12
in row 1 column 4. It has been edited to fix that to not use duplicated
.)
A reduction comparing column to updated-column.
df[,-1] <- Reduce(
function(prev, this) replace(this, is.na(prev) | this == prev, this[NA][1]),
df[,-1], accumulate = TRUE)
df
# Site 2001-12-01 to 2021-12-01 1991-12-01.to 2021-12-01 1981-12-01 to 2021-12-01 1971-12-01 to 2021-12-01
# 1 Att-Bissen 12 5 12 19
# 2 Alz-Ettelbruck 1 4 7 NA
# 3 Our-Gemund/Vianden 12 NA NA NA
# 4 Syre Felsmuhle/Mertert 1 6 20 13
# 5 Ernz Blanche-Larochette 8 14 NA NA
I hard-coded df[,-1]
in both places, it could easily also be df[,2:5]
, it just needs to be the same in both places (LHS of <-
and within the Reduce
).
This loses some efficiency because it double pivots.
library(dplyr)
library(tidyr) # pivot_*
df %>%
pivot_longer(cols = -Site) %>%
arrange(Site, desc(name)) %>%
mutate(.by = "Site", value = if_else(value == lag(value, default=-1L), value[NA], value)) %>%
pivot_wider(id_cols = Site) %>%
slice(match(Site, df$Site)) %>%
select(match(names(.), names(df)))
# # A tibble: 5 × 5
# Site `2001-12-01 to 2021-12-01` `1991-12-01.to 2021-12-01` `1981-12-01 to 2021-12-01` `1971-12-01 to 2021-12-01`
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Att-Bissen 12 5 12 19
# 2 Alz-Ettelbruck 1 4 7 NA
# 3 Syre Felsmuhle/Mertert 1 6 20 13
# 4 Ernz Blanche-Larochette 8 14 NA NA
# 5 Our-Gemund/Vianden 12 NA NA NA
One side-effect of the pivoting is that the order of rows and columns is not guaranteed to be restored, so I added the mostly-aesthetic slice(.) %>% select(.)
to the end to marry up with your input data. (It is not at all required.)
Data
df <- structure(list(Site = c("Att-Bissen", "Alz-Ettelbruck", "Our-Gemund/Vianden", "Syre Felsmuhle/Mertert", "Ernz Blanche-Larochette"), "2001-12-01 to 2021-12-01" = c(12, 1, 12, 1, 8), "1991-12-01.to 2021-12-01" = c(5, 4, 12, 6, 14), "1981-12-01 to 2021-12-01" = c(12, 7, 12, 20, 14), "1971-12-01 to 2021-12-01" = c(19, 7, 12, 13, 14)), class = "data.frame", row.names = c(NA, -5L))