I apologize if this kind of question was already asked and solved. However, I tried to get to the solution without success.
I have two datasets in R, DF1
and DF2
; For example:
DF1 <- data.frame(point_code = c("358/5/UW_001", "123/4/UW_002", "089/6/UW_003"),
value = c(10, 20, 30))
DF2 <- data.frame(Sample.Number = c("AT/ACE/1/1/358/5/DA/3", "AT/ACE/2/2/123/4/DA/4", "AT/ACE/3/3/89/6/DA/5"),
score = c(100, 200, 300))
(N.B. the two datasets contain several more columns to merge)
I want to merge the two datasets by matching the part of the code in point_code that is contained in Sample.Number, such as 358/5. The desired output is something like this:
point_code value Sample.Number score
1 358/5/UW_001 10 AT/ACE/1/1/358/5/DA/3 100
2 123/4/UW_002 20 AT/ACE/2/2/123/4/DA/4 200
3 089/6/UW_003 30 AT/ACE/3/3/89/6/DA/5 300
I tried several ways to merge them.
First, by extracting the part of the ID code that I want to match
DF1$match_code <- sapply(strsplit(DF1$point_code, "/"), function(x) paste(x[1:2], collapse = "/"))
DF2$match_code <- sapply(strsplit(DF2$Sample.Number, "/"), function(x) paste(x[5:6], collapse = "/"))
merged_data <- merge(DF2, DF1[, c("Lat", "Lon", "var1_P", "var2_P", "var3_P")], by = "match_code")
however, the codes may be have different amount of numbers: e.g., "089/6" and "89/6" that does not allow to match them.
Second, I tried using the fuzzyjoin
package and the regex_left_join
function, but I got an error:
merged <- regex_left_join(DF1, DF2,
by = c(point_code = "Sample.Number"),
pattern = "(\\d+/\\d+)/")
Error in regex_left_join(phy, rawphyA, by = c(point_code = "ACE.Sample.Number"), :
unused argument (pattern = "(\\d+/\\d+)/")
Is there a more efficient (and working) way to merge the dfs by matching a partial string to a longer one? thank you
------- AFTER FIRST COMMENT------
The suggestion of @AllanCameron and @Mark are good. However, I do not get the correct solution still:
Lat.x Lon.x point_code Lat.y Lon.y Sample.Number
1 40.03039 28.87434 358/5/UW_001 NA NA <NA>
2 41.25814 29.80981 358/7/UW_003 41.2581 29.80981 AT/ACE/1/1/358/7/DA/3
3 42.31510 31.03547 358/9/UW_005 42.3151 31.03547 AT/ACE/1/1/358/9/DA/3
4 43.23883 32.42131 359/11/UW_007 44.5039 34.72199 AT/ACE/1/1/359/19/DA/3
5 43.23883 32.42131 359/11/UW_007 42.8013 31.73753 AT/ACE/1/1/359/10/DA/3
6 43.23883 32.42131 359/11/UW_007 43.2388 32.42131 AT/ACE/1/1/359/11/DA/3
7 43.23883 32.42131 359/11/UW_007 43.7092 33.15066 AT/ACE/1/1/359/12/DA/3
8 43.23883 32.42131 359/11/UW_007 44.1538 33.90936 AT/ACE/1/1/359/13/DA/3
9 44.15381 33.90936 359/13/UW_009 44.5039 34.72199 AT/ACE/1/1/359/19/DA/3
It duplicates the observations of DF1 (e.g., 359/11) and it does not pair the rows well.
I share the full two datasets (https://filetransfer.io/data-package/hgOG7cG7#link).
To illustrate Allan Cameron's suggestion/comment:
library(tidyverse)
left_join(mutate(DF1, join = str_extract(point_code, "[1-9]{2,3}/[0-9]{1}")),
mutate(DF2, join = str_extract(Sample.Number, "[1-9]{2,3}/[0-9]{1}")), by = "join")|>
select(-join)
Output:
point_code value Sample.Number score
1 358/5/UW_001 10 AT/ACE/1/1/358/5/DA/3 100
2 123/4/UW_002 20 AT/ACE/2/2/123/4/DA/4 200
3 089/6/UW_003 30 AT/ACE/3/3/89/6/DA/5 300
Of course, this is only three rows, so I might be making generalisations that don't work with your original dataset. If so, please provide more examples! :-)
Update: Here's a way with your actual dataset:
full_join(
mutate(df2, join = str_remove_all(Sample.Number, "/DA/\\d+|AT/ACE/\\d+/\\d+/")),
mutate(df1, join = str_remove(point_code, "/[A-Z].*") |> str_remove("^0+")), by = "join")