i have the following test-code for you:
####TESTING HERE
test = tibble::tribble(
~Name1, ~Name2, ~Name3,
"Paul Walker", "Paule Walkr", "Heiko Knaup",
"Ferdinand Bass", "Ferdinand Base", "Michael Herre"
)
library(stringdist)
output <- list()
for (row in 1:nrow(test))
{
codephon = phonetic(test[row,], method = c("soundex"), useBytes = FALSE)
output[[row]] <- codephon
}
#building the matrix with soundex input
phoneticmatrix = matrix(output)
soundexspalten=str_split_fixed(phoneticmatrix, ",", 3)
#> Error in str_split_fixed(phoneticmatrix, ",", 3): konnte Funktion "str_split_fixed" nicht finden
soundexmatrix0 = gsub('[()c"]', '', soundexspalten)
#> Error in gsub("[()c\"]", "", soundexspalten): Objekt 'soundexspalten' nicht gefunden
soundexmatrix1 = gsub("0000", "", soundexmatrix0)
#> Error in gsub("0000", "", soundexmatrix0): Objekt 'soundexmatrix0' nicht gefunden
Created on 2021-06-03 by the reprex package (v2.0.0)
now I want to !!!replace all duplicates in soundexmatrix1 with the string "DUPLICATE" so the dimension of the Matrix stays the same and all duplicates can be seen straight away.
Any ideas how to do that? Thanks for your help!
To check for duplicates within each row (see Update), this should achieve what you want, and in a cleaner fashion:
# Feel free to load the packages you're using.
# library(stringdist)
# library(tibble)
test <- tibble::tribble(
~Name1, ~Name2, ~Name3,
"Paul Walker", "Paule Walkr", "Heiko Knaup",
"Ferdinand Bass", "Ferdinand Base", "Michael Herre"
)
# Get phonetic codes cleanly.
result <- as.matrix(apply(X = test, MARGIN = 2,
FUN = stringdist::phonetic, method = c("soundex"), useBytes = FALSE))
# Find all blank codes ("0000").
blanks <- result == "0000"
# # Find all duplicates, as compared across ENTIRE matrix; ignore blank codes.
# all_duplicates <- !blanks & duplicated(result, MARGIN = 0)
# Find duplicates, as compared within EACH ROW; ignore blank codes.
row_duplicates <- !blanks & t(apply(X = result, MARGIN = 1, FUN = duplicated))
# Replace blank codes ("0000") with blanks (""); and replace duplicates (found
# within rows) with "DUPLICATE".
result[blanks] <- ""
result[row_duplicates] <- "DUPLICATE"
# View result.
result
The result
should be the following matrix:
Name1 Name2 Name3
[1,] "P442" "DUPLICATE" "H225"
[2,] "F635" "DUPLICATE" "M246"
Per the poster's request, I have altered the code to compare for duplicates only within each row, rather than across the entire result
matrix. Now, a test
dataset like
test <- tibble::tribble(
~Name1, ~Name2, ~Name3,
"Paul Walker", "Paule Walkr", "Heiko Knaup",
"Ferdinand Bass", "Ferdinand Base", "Michael Herre",
"", "01234 56789", "Heiko Knaup"
# | ^^ | ^^^^^^^^^^^^^ | ^^^^^^^^^^^^^ |
# | Coded as "0000" | Coded as "0000" | Duplicate in matrix, NOT in row |
)
will give a result
like
Name1 Name2 Name3
[1,] "P442" "DUPLICATE" "H225"
[2,] "F635" "DUPLICATE" "M246"
[3,] "" "" "H225"