I have a dataframe in which I want to keep all the distinct string entries (i.e. get rid of duplicates) in one column unless these entries are short, str_length < 7. I also want to keep all the other columns.
So I have
string | other columns |
---|---|
"abc" | |
"abc" | |
"centauri" | |
"centauri" | |
"armageddon" | |
"armageddon" | |
"spaghetti" |
Desired output:
string | other columns |
---|---|
"abc" | |
"abc" | |
"centauri" | |
"armageddon" | |
"spaghetti" |
I have tried a variety of dplyr approaches, but nothing works.
df <- df %>%
mutate(len = str_length(string))%>%
group_by(string, len) %>%
filter(len >7) %>%
distinct(.keep_all = TRUE)
In this example, I am not getting the rows back which I filtered out. But I just want to protect the filtered rows from the distinct function and then get them back into the dataframe.
We can use duplicated
with nchar
df1[!(duplicated(df1$string) & nchar(df1$string) > 7), , drop = FALSE]
-output
# string
#1 abc
#2 abc
#3 centauri
#5 armageddon
#7 spaghetti
Or with filter
in dplyr
library(dplyr)
df1 %>%
filter(!(duplicated(string) & nchar(string) > 7))
df1 <- structure(list(string = c("abc", "abc", "centauri", "centauri",
"armageddon", "armageddon", "spaghetti")), class = "data.frame",
row.names = c(NA,
-7L))