Search code examples
rmatrixduplicatesstringdist

Replace duplicates in matrix


i have the following test-code for you:

####TESTING HERE
test = tibble::tribble(
                          ~Name1,           ~Name2,          ~Name3,
                   "Paul Walker",    "Paule Walkr",   "Heiko Knaup",
                "Ferdinand Bass", "Ferdinand Base", "Michael Herre"
                )

library(stringdist)
output <- list()
for (row in 1:nrow(test)) 
{
  codephon = phonetic(test[row,], method = c("soundex"), useBytes = FALSE)
  output[[row]] <- codephon
}

#building the matrix with soundex input
phoneticmatrix = matrix(output)
soundexspalten=str_split_fixed(phoneticmatrix, ",", 3)
#> Error in str_split_fixed(phoneticmatrix, ",", 3): konnte Funktion "str_split_fixed" nicht finden
soundexmatrix0 = gsub('[()c"]', '', soundexspalten)
#> Error in gsub("[()c\"]", "", soundexspalten): Objekt 'soundexspalten' nicht gefunden
soundexmatrix1 = gsub("0000", "", soundexmatrix0)
#> Error in gsub("0000", "", soundexmatrix0): Objekt 'soundexmatrix0' nicht gefunden

Created on 2021-06-03 by the reprex package (v2.0.0)

now I want to !!!replace all duplicates in soundexmatrix1 with the string "DUPLICATE" so the dimension of the Matrix stays the same and all duplicates can be seen straight away.

Any ideas how to do that? Thanks for your help!


Solution

  • To check for duplicates within each row (see Update), this should achieve what you want, and in a cleaner fashion:

    # Feel free to load the packages you're using.
    # library(stringdist)
    # library(tibble)
    
    test <- tibble::tribble(
      ~Name1,           ~Name2,           ~Name3,
      "Paul Walker",    "Paule Walkr",    "Heiko Knaup",
      "Ferdinand Bass", "Ferdinand Base", "Michael Herre"
    )
    
    # Get phonetic codes cleanly.
    result <- as.matrix(apply(X = test, MARGIN = 2,
                              FUN = stringdist::phonetic, method = c("soundex"), useBytes = FALSE))
    
    # Find all blank codes ("0000").
    blanks <- result == "0000"
    
    # # Find all duplicates, as compared across ENTIRE matrix; ignore blank codes.
    # all_duplicates <- !blanks & duplicated(result, MARGIN = 0)
    
    # Find duplicates, as compared within EACH ROW; ignore blank codes.
    row_duplicates <- !blanks & t(apply(X = result, MARGIN = 1, FUN = duplicated))
    
    # Replace blank codes ("0000") with blanks (""); and replace duplicates (found
    # within rows) with "DUPLICATE".
    result[blanks] <- ""
    result[row_duplicates] <- "DUPLICATE"
    
    # View result.
    result
    

    The result should be the following matrix:

         Name1  Name2       Name3 
    [1,] "P442" "DUPLICATE" "H225"
    [2,] "F635" "DUPLICATE" "M246"
    

    Update

    Per the poster's request, I have altered the code to compare for duplicates only within each row, rather than across the entire result matrix. Now, a test dataset like

    test <- tibble::tribble(
        ~Name1,           ~Name2,           ~Name3,
        "Paul Walker",    "Paule Walkr",    "Heiko Knaup",
        "Ferdinand Bass", "Ferdinand Base", "Michael Herre",
        "",               "01234 56789",    "Heiko Knaup"
    # | ^^              | ^^^^^^^^^^^^^   | ^^^^^^^^^^^^^                   |
    # | Coded as "0000" | Coded as "0000" | Duplicate in matrix, NOT in row |
    )
    

    will give a result like

         Name1  Name2       Name3 
    [1,] "P442" "DUPLICATE" "H225"
    [2,] "F635" "DUPLICATE" "M246"
    [3,] ""     ""          "H225"