Search code examples
rdataframejoindplyrdata-cleaning

How to merge two datasets in R by partially matching text in columns?


I apologize if this kind of question was already asked and solved. However, I tried to get to the solution without success.

I have two datasets in R, DF1 and DF2; For example:

DF1 <- data.frame(point_code = c("358/5/UW_001", "123/4/UW_002", "089/6/UW_003"),
                  value = c(10, 20, 30))

DF2 <- data.frame(Sample.Number = c("AT/ACE/1/1/358/5/DA/3", "AT/ACE/2/2/123/4/DA/4", "AT/ACE/3/3/89/6/DA/5"),
                     score = c(100, 200, 300))

(N.B. the two datasets contain several more columns to merge)

I want to merge the two datasets by matching the part of the code in point_code that is contained in Sample.Number, such as 358/5. The desired output is something like this:

    point_code value Sample.Number score
1 358/5/UW_001    10 AT/ACE/1/1/358/5/DA/3   100
2 123/4/UW_002    20 AT/ACE/2/2/123/4/DA/4   200
3 089/6/UW_003    30 AT/ACE/3/3/89/6/DA/5   300

I tried several ways to merge them.

First, by extracting the part of the ID code that I want to match

DF1$match_code <- sapply(strsplit(DF1$point_code, "/"), function(x) paste(x[1:2], collapse = "/"))
DF2$match_code <- sapply(strsplit(DF2$Sample.Number, "/"), function(x) paste(x[5:6], collapse = "/"))

merged_data <- merge(DF2, DF1[, c("Lat", "Lon", "var1_P", "var2_P", "var3_P")], by = "match_code")

however, the codes may be have different amount of numbers: e.g., "089/6" and "89/6" that does not allow to match them.

Second, I tried using the fuzzyjoin package and the regex_left_join function, but I got an error:

merged <- regex_left_join(DF1, DF2, 
                          by = c(point_code = "Sample.Number"),
                          pattern = "(\\d+/\\d+)/")

Error in regex_left_join(phy, rawphyA, by = c(point_code = "ACE.Sample.Number"),  : 
  unused argument (pattern = "(\\d+/\\d+)/")

Is there a more efficient (and working) way to merge the dfs by matching a partial string to a longer one? thank you

------- AFTER FIRST COMMENT------

The suggestion of @AllanCameron and @Mark are good. However, I do not get the correct solution still:

 Lat.x    Lon.x    point_code   Lat.y     Lon.y      Sample.Number
1   40.03039 28.87434  358/5/UW_001      NA        NA                   <NA>
2   41.25814 29.80981  358/7/UW_003 41.2581  29.80981  AT/ACE/1/1/358/7/DA/3
3   42.31510 31.03547  358/9/UW_005 42.3151  31.03547  AT/ACE/1/1/358/9/DA/3
4   43.23883 32.42131 359/11/UW_007 44.5039  34.72199 AT/ACE/1/1/359/19/DA/3
5   43.23883 32.42131 359/11/UW_007 42.8013  31.73753 AT/ACE/1/1/359/10/DA/3
6   43.23883 32.42131 359/11/UW_007 43.2388  32.42131 AT/ACE/1/1/359/11/DA/3
7   43.23883 32.42131 359/11/UW_007 43.7092  33.15066 AT/ACE/1/1/359/12/DA/3
8   43.23883 32.42131 359/11/UW_007 44.1538  33.90936 AT/ACE/1/1/359/13/DA/3
9   44.15381 33.90936 359/13/UW_009 44.5039  34.72199 AT/ACE/1/1/359/19/DA/3

It duplicates the observations of DF1 (e.g., 359/11) and it does not pair the rows well.

I share the full two datasets (https://filetransfer.io/data-package/hgOG7cG7#link).


Solution

  • To illustrate Allan Cameron's suggestion/comment:

    library(tidyverse)
    
    left_join(mutate(DF1, join = str_extract(point_code, "[1-9]{2,3}/[0-9]{1}")), 
              mutate(DF2, join = str_extract(Sample.Number, "[1-9]{2,3}/[0-9]{1}")), by = "join")|>
      select(-join)
    

    Output:

        point_code value         Sample.Number score
    1 358/5/UW_001    10 AT/ACE/1/1/358/5/DA/3   100
    2 123/4/UW_002    20 AT/ACE/2/2/123/4/DA/4   200
    3 089/6/UW_003    30  AT/ACE/3/3/89/6/DA/5   300
    

    Of course, this is only three rows, so I might be making generalisations that don't work with your original dataset. If so, please provide more examples! :-)

    Update: Here's a way with your actual dataset:

    full_join(
    mutate(df2, join = str_remove_all(Sample.Number, "/DA/\\d+|AT/ACE/\\d+/\\d+/")),
    mutate(df1, join = str_remove(point_code, "/[A-Z].*") |> str_remove("^0+")), by = "join")