Search code examples
rtidyverserecode

R: Replacing Strings with their Most Common Variant


I'm looking to standardise a set of manually inputted strings, so that:

index   fruit
1   Apple Pie
2   Apple Pie.
3   Apple. Pie
4   Apple Pie
5   Pear

should look like:

index   fruit
1   Apple Pie
2   Apple Pie
3   Apple Pie
4   Apple Pie
5   Pear

For my use case, grouping them by phonetic sound is fine, but I'm missing the piece on how to replace the least common strings with the most common ones.

library(tidyverse)  
library(stringdist)

index <- seq(1,5,1)
fruit <- c("Apple Pie", "Apple Pie.", "Apple. Pie", "Apple Pie", "Pear")

df <- data.frame(index, fruit) %>%
  mutate(grouping = phonetic(fruit)) %>%
  add_count(fruit) %>%
  # Missing Code
  select(index, fruit)

Solution

  • Sounds like you need group_by the grouping, then select the most frequent (Mode) item

    df%>%mutate(grouping = phonetic(fruit))%>%
         group_by(grouping)%>%
         mutate(fruit = names(which.max(table(fruit))))
    
    # A tibble: 5 x 3
    # Groups:   grouping [2]
      index     fruit grouping
      <dbl>    <fctr>    <chr>
    1     1 Apple Pie     A141
    2     2 Apple Pie     A141
    3     3 Apple Pie     A141
    4     4 Apple Pie     A141
    5     5      Pear     P600