I'm looking to standardise a set of manually inputted strings, so that:
index fruit
1 Apple Pie
2 Apple Pie.
3 Apple. Pie
4 Apple Pie
5 Pear
should look like:
index fruit
1 Apple Pie
2 Apple Pie
3 Apple Pie
4 Apple Pie
5 Pear
For my use case, grouping them by phonetic sound is fine, but I'm missing the piece on how to replace the least common strings with the most common ones.
library(tidyverse)
library(stringdist)
index <- seq(1,5,1)
fruit <- c("Apple Pie", "Apple Pie.", "Apple. Pie", "Apple Pie", "Pear")
df <- data.frame(index, fruit) %>%
mutate(grouping = phonetic(fruit)) %>%
add_count(fruit) %>%
# Missing Code
select(index, fruit)
Sounds like you need group_by
the grouping, then select the most frequent (Mode) item
df%>%mutate(grouping = phonetic(fruit))%>%
group_by(grouping)%>%
mutate(fruit = names(which.max(table(fruit))))
# A tibble: 5 x 3
# Groups: grouping [2]
index fruit grouping
<dbl> <fctr> <chr>
1 1 Apple Pie A141
2 2 Apple Pie A141
3 3 Apple Pie A141
4 4 Apple Pie A141
5 5 Pear P600