Search code examples
rrsqlite

R: How to remove values from a table that appear in another table?


I have data that looks like this:

> head(dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes"'))
      gene_id sample_1 sample_2 status value_1 value_2 log2_fold_change test_stat p_value   q_value significant
1 XLOC_000219       M4       M3     OK 3.85465 0.00000             -Inf        NA   5e-05 0.0075951         yes
2 XLOC_004272       M4       M3     OK 2.06687 0.00000             -Inf        NA   5e-05 0.0075951         yes
3 XLOC_004991       M4       M3     OK 3.29904 0.00000             -Inf        NA   5e-05 0.0075951         yes
4 XLOC_007234       M4       M3     OK 1.28027 0.00000             -Inf        NA   5e-05 0.0075951         yes
5 XLOC_000664       M4       F4     OK 1.46853 0.00000             -Inf        NA   5e-05 0.0075951         yes
6 XLOC_001809       M4       F4     OK 0.00000 1.91743              Inf        NA   5e-05 0.0075951         yes

I've produced two subsets with RSQLite:

M4M3 <- dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes" AND sample_1 = "M4" AND sample_2 = "M3"')

M4F4 <- dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes" AND sample_1 = "M4" AND sample_2 = "F4"')

I'd like to remove all values from M4M3 that have a matching gene_id in M4F4. It's not important that I use RSQLite to filter the data set, it could be a pure R solution, but I'm not sure how to compare tables and remove rows from one based on another.

Thanks for any advice!


Solution

  • There's a bunch of ways to do this.

    Base R subset solution (as noted by Balter above):

    M4M3.new <- M4M3[!(M4M3$gene_id %in% M4F4$gene_id),]
    

    Base R set union solution:

    M4M3.new <- setdiff(M4M3, M4F4)
    

    Dplyr solution

    M4M3.new <- dplyr::anti_join(M4M3, 
                                 M4F4, 
                                 by = c("gene_id" = "gene_id"))
    

    Edit: All appeared to work tested on the following dataset:

    tst1 <- data.frame(gene_id = seq(1:10), 
                       sample_1 = rep("M4", 10), 
                       sample_2 = c(rep("M3", 6), rep("F4", 4)), 
                       other_values = sample(1:10, 10, replace = T),
                       other_values2 = rep("OK", 10))
    
    M4M3 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2  == "M3",]
    M4F4 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2  == "F4",]