Search code examples
rsplitduplicatesdata-cleaningcamelcasing

A more elegant way to remove duplicated names (phrases) in the elements of a character string



I have a vector of organization names in a dataframe. Some of them are just fine, others have the name repeated twice in the same element. Also, when that name is repeated, there is no separating space so the name has a camelCase appearance.

For example (id column added for general dataframe referencing):
id org
1 Alpha Company
2 Bravo InstituteBravo Institute
3 Charlie Group
4 Delta IncorporatedDelta Incorporated

but it should look like:
id org
1 Alpha Company
2 Bravo Institute
3 Charlie Group
4 Delta Incorporated

I have a solution that gets the result I need--reproducible example code below. However, it seems a bit lengthy and not very elegant.

Does anyone have a better approach for the same results?

Bonus question: If organizations have 'types' included, such as Alpha Company, LLC, then my gsub() line to fix the camelCase does not work as well. Any suggestions on how to adjust the camelCase fix to account for the ", LLC" and still work with the rest of the solution?

Thanks in advance! (Thanks to the OP & those who helped on the previous SO post about splitting camelCase strings in R)

# packages
library(stringr)
# toy data
df <- data.frame(id=1:4, org=c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
# split up & clean camelCase words
df$org_fix <- gsub("([A-Z])", " \\1", df$org)
df$org_fix <- str_trim(str_squish(df$org_fix))
# temp vector with half the org names
df$org_half <- word(df$org_fix, start=1, end=(sapply(strsplit(df$org_fix, " "), length)/2)) # stringr::word
# double the temp vector
df$org_dbl <- paste(df$org_half, df$org_half)
# flag TRUE for orgs that contain duplicates in name
df$org_dup <- df$org_fix == df$org_dbl
# corrected the org names
df$org_fix <- ifelse(df$org_dup, df$org_half, df$org_fix)
# drop excess columns
df <- df[,c("id", "org_fix")]

# toy data for the bonus question
df2 <- data.frame(id=1:4, org=c("Alpha Company, LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))

Solution

  • Another approach is to compare the first half of the string with the second half of the string. If equal, pick the first half. It also works if there are numbers, underscores or any other characters present in the company name.

    org <- c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated", "WD40WD40", "3M3M")
    
    ifelse(substring(org, 1, nchar(org) / 2) == substring(org, nchar(org) / 2 + 1, nchar(org)), substring(org, 1, nchar(org) / 2), org)
    
    # [1] "Alpha Company" "Bravo Institute" "Charlie Group" "Delta Incorporated" "WD40" "3M"