I have two files, and I would like to merge them based on the first two columns, if the lines match, and add the third column of file2 to file1.
File1
CHR BP SNP ZSCORE
1 2534 rs123 0.5
1 2860 rs225 0.1
1 10000 rs356 0.8
File2
CHR BP AF
1 2534 0.02
1 6538 0.1
1 12345 0.1
What I want is:
File3
CHR BP SNP ZSCORE AF
1 2534 rs123 0.5 0.02
I tried R's merge()
, but it led to just adding the columns together in the new dataset:
merged<-merge(File1, File2, by.x = "BP", by.y = "CHR", all = TRUE)
merged<-merge(File1, File2, by.x = "CHR", by.y = "BP", all = TRUE)
The next merge()
I tried:
merged <- merge(File1[c('CHR','BP')],File2)
produced additional rows, which I don't think it's correct. How can matching a file of 17 million rows against the file of 17,200,000 rows produce 18,000,000 rows?!
Tried dplyr left join:
merged<-dplyr::left_join(File1, File2, by = c("CHR", "BP"))
Warning message:
In dplyr::left_join(File1, File2, by = c("CHR", "BP")) :
Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 3570136 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.
Reversing the files led to the same error.
I also tried various versions of awk and join, but didn't produce what I wanted.
If anyone knows the solution, please let me know, doesn't matter whether it's an R solution, or awk/join-all.
Using the input in the Note at the end
1) merge This uses base R merge
merge(File1, File2, by = 1:2)
## CHR BP SNP ZSCORE AF
## 1 1 2534 rs123 0.5 0.02
2) inner_join This uses dplyr inner_join
:
library(dplyr)
inner_join(File1, File2, join_by(CHR, BP))
## CHR BP SNP ZSCORE AF
## 1 1 2534 rs123 0.5 0.02
3 sqldf If the files are too large it can be done outside of R provided the result is still small enough for R using sqldf:
# create test files
write.csv(File1, "File1.csv", quote = FALSE, row.names = FALSE)
write.csv(File2, "File2.csv", quote = FALSE, row.names = FALSE)
rm(File1, File2)
sqldf("select * from File1 inner join File2 using (CHR, BP)",
dbname = tempfile())
## CHR BP SNP ZSCORE AF
## 1 1 2534 rs123 0.5 0.02
4) xsv Another way to do it outuside of R is to use the xsv command line utility. Assuming the same setup as for sqldf
library(data.table)
fread("xsv join CHR,BP File1.csv CHR,BP File2.csv")
## CHR BP SNP ZSCORE CHR BP AF
## 1: 1 2534 rs123 0.5 1 2534 0.02
Lines1 <- "
CHR BP SNP ZSCORE
1 2534 rs123 0.5
1 2860 rs225 0.1
1 10000 rs356 0.8e"
Lines2 <- "CHR BP AF
1 2534 0.02
1 6538 0.1
1 12345 0.1"
File1 <- read.table(text = Lines1, header = TRUE)
File2 <- read.table(text = Lines2, header = TRUE)