Search code examples
rdata.tablestringdist

Recycling error while using stringdist and data.table in R


I am trying to perform an approximate string matching for a data.table containing author names basis a dictionary of "first" names. I have also set a high threshold say above 0.9 to improve the quality of matching.

However, I get an error message given below:

Warning message:
In [`<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).

This error occurs even if I round the similarity matching down to 4 digits using signif(similarity_score,4).

Some more information about the input data and approach:

  1. The author_corrected_df is a data.table containing columns: "Author" and "Author_Corrected". Author_Corrected is an alphabet representation of the corresponding Author (Eg: if Author = Jack123, then Author_Corrected = Jack).
  2. The Author_Corrected column can have variations of a proper first name eg: Jackk instead of Jack, and I would like to populate the corresponding gender in this author_corrected_df called Gender_Dict.
  3. Another data.table called first_names_dict contains the 'name' (i.e. first name) and gender (0 for female, 1 for male, 2 for ties).
  4. I would like to find the most relevant match from the "Author_Corrected" per row with respect the the 'name' in first_names_dict and populate the corresponding gender (either one of 0,1,2).
  5. To make the string matching more stringent, I use a threshold of 0.9720, else later in the code (not shown below), the non-matched values are then represented as NA.
  6. The first_names_dict and the author_corrected_df can be accessed from the link below: https://wetransfer.com/downloads/6efe42597519495fcd2c52264c40940a20190612130618/0cc87541a9605df0fcc15297c4b18b7d20190612130619/6498a7
for (ijk in 1:nrow(author_corrected_df)){
  max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
  if (signif(max_sim1,4) >= 0.9720){
    row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
    author_corrected_df$Gender_Dict[ijk] <- first_names_dict$gender[row_idx1]
  } else {
    next
  }
}

While execution I get the following error message:

Warning message:
In `[<-.data.table`(x, j = name, value = value) :
  Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).

Would appreciate help in terms of knowing where the error lies and if there is a faster way to perform this sort of matching (though the latter one is second priority).

Thanks in advance.


Solution

  • Following previous comments, here I select the gender most present in your selection :

    for (ijk in 1:nrow(author_corrected_df)){
            max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
            if (signif(max_sim1,4) >= 0.9720){
                    row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
    
                    # Analysis of factor gender
                    gender <- as.character( first_names_dict$gender[row_idx1] )
    
                    # I take the (first) gender most present in selection 
                    df_count <- as.data.frame( table(gender) )
                    ref <- as.character ( df_count$test[which.max(df_count$Freq)] )
                    value <- unique ( test[which(test == ref)] )
    
                    # Affecting single character value to data frame
                    author_corrected_df$Gender_Dict[ijk] <- value
            }
    }
    

    Hope this helps :)