I am trying to perform an approximate string matching for a data.table containing author names basis a dictionary of "first" names. I have also set a high threshold say above 0.9 to improve the quality of matching.
However, I get an error message given below:
Warning message:
In [`<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
This error occurs even if I round the similarity matching down to 4 digits using signif(similarity_score,4).
Some more information about the input data and approach:
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
author_corrected_df$Gender_Dict[ijk] <- first_names_dict$gender[row_idx1]
} else {
next
}
}
While execution I get the following error message:
Warning message:
In `[<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
Would appreciate help in terms of knowing where the error lies and if there is a faster way to perform this sort of matching (though the latter one is second priority).
Thanks in advance.
Following previous comments, here I select the gender most present in your selection :
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
# Analysis of factor gender
gender <- as.character( first_names_dict$gender[row_idx1] )
# I take the (first) gender most present in selection
df_count <- as.data.frame( table(gender) )
ref <- as.character ( df_count$test[which.max(df_count$Freq)] )
value <- unique ( test[which(test == ref)] )
# Affecting single character value to data frame
author_corrected_df$Gender_Dict[ijk] <- value
}
}
Hope this helps :)