Search code examples
rlinuxmergematchmultiple-columns

Matching the same lines from 2 different files and 2 columns


I have two files, and I would like to merge them based on the first two columns, if the lines match, and add the third column of file2 to file1.

File1

CHR BP SNP ZSCORE
1   2534 rs123 0.5
1   2860 rs225 0.1
1   10000 rs356 0.8

File2

CHR BP AF
1   2534 0.02
1   6538 0.1
1   12345 0.1

What I want is:

File3

CHR BP SNP ZSCORE AF
1   2534 rs123 0.5 0.02

I tried R's merge(), but it led to just adding the columns together in the new dataset:

merged<-merge(File1, File2, by.x = "BP", by.y = "CHR", all = TRUE)
merged<-merge(File1, File2, by.x = "CHR", by.y = "BP", all = TRUE)

The next merge() I tried:

merged <- merge(File1[c('CHR','BP')],File2)

produced additional rows, which I don't think it's correct. How can matching a file of 17 million rows against the file of 17,200,000 rows produce 18,000,000 rows?!

Tried dplyr left join:

merged<-dplyr::left_join(File1, File2, by = c("CHR", "BP"))

Warning message:
In dplyr::left_join(File1, File2, by = c("CHR", "BP")) :
  Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 3570136 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.

Reversing the files led to the same error.

I also tried various versions of awk and join, but didn't produce what I wanted.

If anyone knows the solution, please let me know, doesn't matter whether it's an R solution, or awk/join-all.


Solution

  • Using the input in the Note at the end

    1) merge This uses base R merge

    merge(File1, File2, by = 1:2)
    ##   CHR   BP   SNP ZSCORE   AF
    ## 1   1 2534 rs123    0.5 0.02
    

    2) inner_join This uses dplyr inner_join:

    library(dplyr)
    inner_join(File1, File2, join_by(CHR, BP))
    ##   CHR   BP   SNP ZSCORE   AF
    ## 1   1 2534 rs123    0.5 0.02
    

    3 sqldf If the files are too large it can be done outside of R provided the result is still small enough for R using sqldf:

    # create test files
    write.csv(File1, "File1.csv", quote = FALSE, row.names = FALSE)
    write.csv(File2, "File2.csv", quote = FALSE, row.names = FALSE)
    rm(File1, File2)
    
    sqldf("select * from File1 inner join File2 using (CHR, BP)", 
      dbname = tempfile())
    ##   CHR   BP   SNP ZSCORE   AF
    ## 1   1 2534 rs123    0.5 0.02
    

    4) xsv Another way to do it outuside of R is to use the xsv command line utility. Assuming the same setup as for sqldf

    library(data.table)
    fread("xsv join CHR,BP File1.csv CHR,BP File2.csv")
    ##    CHR   BP   SNP ZSCORE CHR   BP   AF
    ## 1:   1 2534 rs123    0.5   1 2534 0.02
    

    Note

    Lines1 <- "
    CHR BP SNP ZSCORE
    1   2534 rs123 0.5
    1   2860 rs225 0.1
    1   10000 rs356 0.8e"
    
    Lines2 <- "CHR BP AF
    1   2534 0.02
    1   6538 0.1
    1   12345 0.1"
    
    File1 <- read.table(text = Lines1, header = TRUE)
    File2 <- read.table(text = Lines2, header = TRUE)