Search code examples
rdataframematchmismatch

Select missing rows in different dataframes


I have two dataframes: list1 and list2

>head(list1)
       RS_ID CHROM       POS REF_ALLELE ALT_ALLELE AF_REF_allsamples
1 rs77599058     1 195680131          C          T            0.9996
2 rs73056353     1 195680971          A          G            0.9999
3 rs12130880     1 195681419          A          T            0.5475
4 rs76457267     1 195681460          A          C            0.9993
5 rs10921893     1 195681616          T          C            0.5060
6 rs75239769     1 195682022          G          A            0.9999
  AF_ALT_allsamples AF_REF_onlycontrol AF_ALT_onlycontrol pvalues
1            0.0004             0.9996             0.0004  0.7830
2            0.0001             0.9998             0.0002  0.3740
3            0.4525             0.5442             0.4558  0.0597
4            0.0007             0.9992             0.0008  0.3590
5            0.4940             0.5099             0.4901  0.0302
6            0.0001             1.0000             0.0000  0.5500
>head(list2)
       RS_ID CHROM       POS REF_ALLELE ALT_ALLELE AF_REF_allsamples
1 rs77599058     1 195680131          C          T            0.9996
2 rs73056353     1 195680971          A          G            0.9999
3 rs12130880     1 195681419          A          T            0.5475
4 rs76457267     1 195681460          A          C            0.9993
5 rs10921893     1 195681616          T          C            0.5060
6 rs75239769     1 195682022          G          A            0.9999
  AF_ALT_allsamples AF_REF_onlycontrol AF_ALT_onlycontrol pvalues
1            0.0004             0.9996             0.0004  0.7830
2            0.0001             0.9998             0.0002  0.3740
3            0.4525             0.5442             0.4558  0.0597
4            0.0007             0.9992             0.0008  0.3590
5            0.4940             0.5099             0.4901  0.0302
6            0.0001             1.0000             0.0000  0.5500
> dim(list1)
[1] 235111     10
> dim(list2)
[1] 234520     10

as you can see with dim() they differ in number of rows by 591. I now want to get a new dataframe with all rows from list1 that are not in list2 (those 591)

I tried

> match_diff=list1[!(list1 %in% list2)]
> dim(match_diff)
[1] 235111     10

but as you can see it tells me, that all rows from list1 differ from list2.

I checked with str() if there's an underlying cause, but both are identical (originate from the same rawdata)

I can't check by a single column but must compare each row as a whole.


Solution

  • This is database join operation. If you search for joins you will find more information on the different kinds out there. As @starja said, you want the anti_join from dplyr:

    Install dplyr if you don't have it already with install.packages('dplyr')

    R> list1 <- data.frame(a=0:5, b=10:15)
    R> list2 <- data.frame(a=(0:5)+3, b=(10:15)+3)
    R> list1
      a  b
    1 0 10
    2 1 11
    3 2 12
    4 3 13
    5 4 14
    6 5 15
    R> list2
      a  b
    1 3 13
    2 4 14
    3 5 15
    4 6 16
    5 7 17
    6 8 18
    R> list3 <- dplyr::anti_join(list1, list2)
    Joining, by = c("a", "b")
    R> list3
      a  b
    1 0 10
    2 1 11
    3 2 12
    R>