Search code examples
rjoininner-joinfuzzyjoin

Join two large datasets in R using both exact and fuzzy matching


I'm trying to inner join two datasets: df1 of 50,000 obs looks something like this:

  Name              | Line.1           | Line.2     | Town       | County       | Postcode 
 -------------------|------------------|------------|------------|--------------|---------- 
  ACME Inc          | 63 Long Street   |            | Fakeington | Lincolnshire | PA4 8QU  
  BETA LTD          | 91a              | Main Drove | Cloud City | Something    | BN1 6LD  
  The Giga          | 344 Lorem Street |            | Ipsom      | Dolor        | G2 8LY   

df2 of 500,000 obs looks like this:

  Name              | AddressLine1   | AddressLine2     | AddressLine3 | AddressLine4 | Postcode | RatingValue 
 -------------------|----------------|------------------|--------------|--------------|----------|------------- 
  ACME              |                | 63 Long Street   | Fakeington   | Lincolnshire | PA4 8QU  | 1           
  Random Company    |                | Rose Ave         | Fakeington   |              | AB2 51GL | 5           
  BETA Limited      | Business House | 91a Main Drove   | Something    |              | BN1 6LD  | 3           
  Giga Incorporated |                | 344 Lorem Street | Ipsum        | Dolor        | G2 8LY   | 5           

And I want to get to something like df_final.

  Name              | Postcode | RatingValue 
 -------------------|----------|------------- 
  ACME Inc          | PA4 8QU  | 1           
  BETA LTD          | BN1 6LD  | 3           
  Giga Incorporated | G2 8LY   | 5           

These are one-to-one matches and all the values in df1 should exist in df2. Postcode is an exact match while the address is split into multiple lines with no regular pattern, so I think my best bet is to match by Name.

I tried the fuzzyjoin package but I'm getting an Error: cannot allocate vector of size 120.6 Gb so I guess I have to use another method that works with bigger datasets.

Any ideas about what the best way to approach this is?

df1 <- data.frame(
  stringsAsFactors = FALSE,
              Name = c("ACME Inc", "BETA LTD", "Giga Incorporated"),
            Line.1 = c("63 Long Street", "91a", "344 Lorem Street"),
            Line.2 = c(NA, "Main Drove", NA),
              Town = c("Fakeington", "Cloud City", "Ipsom"),
            County = c("Lincolnshire", "Something", "Dolor"),
          Postcode = c("PA4 8QU", "BN1 6LD", "G2 8LY")
)

df2 <- data.frame(
  stringsAsFactors = FALSE,
              Name = c("ACME", "Random Company","BETA Limited","Giga Incorporated"),
      AddressLine1 = c(NA, NA, "Business House", NA),
      AddressLine2 = c("63 Long Street", "Rose Ave","91a Main Drove","344 Lorem Street"),
      AddressLine3 = c("Fakeington", "Fakeington", "Something", "Ipsum"),
      AddressLine4 = c("Lincolnshire", NA, NA, "Dolor"),
          Postcode = c("PA4 8QU", "AB2 51GL", "BN1 6LD", "G2 8LY"),
       RatingValue = c(1L, 5L, 3L, 5L)
)

Solution

  • Maybe something like the following will do what the question asks for. It uses package stringdist, not fuzzyjoin.

    First, merge by Postcode only, since the matches are exact. Then get similarities between Name's. If they are above a predetermined threshold, keep those rows.

    thresh <- 0.75
    
    df_final <- merge(df2[c(1, 6:7)], df1[c(1, 6)], by = "Postcode", suffixes = c("",".y"))
    i <- apply(df_final[c(2, 4)], 1, function(x) {stringdist::stringsim(x[1], x[2], method = 'jw')}) >= thresh
    
    df_final <- df_final[i, c(2, 1, 3)]
    
    df_final 
    #               Name Postcode RatingValue
    #1      BETA Limited  BN1 6LD           3
    #2 Giga Incorporated   G2 8LY           5
    #3              ACME  PA4 8QU           1