Search code examples
rsubsetdataformat

How to subset data using multiple characters in a column


This is a very simple question.

I have a lengthy dataset and want to create a subset based on certain entries in a particular column. In this case, I am setting it up like this:

Example data:

> NL

SNP alleles

rs1234 A_T

rs1235 A_G

rs2343 A_T

rs2342 G_C

rs1134 C_G

rs1675 T_A

rs8543 A_T

rs2842 G_A

P <- subset(NL, alleles = "A_T", alleles = "T_A", alleles = "G_C", alleles = "C_G")

This runs without error, but the resulting P is not subset in any way (tail of P still shows same number of entries as original NL).

What am I doing wrong?


Solution

  • The most obvious error is using "=" when you mean"==". But I'm guessing from context that you really want to "split" this data:

    split(NL, NL$alleles)
    

    Which will create a list of dataframes each of which has one of the values for alleles.

    But perhaps you do want to use pattern matching:

    NL[ grepl("C_G|G_C|A_T|T_A", NL$alleles), ]
         SNP alleles
    1 rs1234     A_T
    3 rs2343     A_T
    4 rs2342     G_C
    5 rs1134     C_G
    6 rs1675     T_A
    7 rs8543     A_T
    

    And illustrating with what I think was your comment-example:

    P <- read.table(text="V1 V2 V3 V4 V5 V6 alleles
     15116 25 rsX 0 123412 G A G_A 
    15117 25 rsX1 0 23432 A C A_C 
    15118 25 rsX2 0 234324 A G A_G 
    15119 25 rsX3 0 3423 A G A_G 
    15120 25 rsX4 0 2343223 C A C_A 
    15121 25 rsX5 0 23523423 A G A_G", header=TRUE)
    
     P[ grepl("G_A", NL$alleles), ]
    
    #       V1       V2 V3        V4 V5 V6 alleles
    # 15116 25 rs306910  0 154613671  G  A     G_A
    

    The subset version:

     subset(P, alleles %in% c("G_A", "A_G") )
    
          V1   V2 V3       V4 V5 V6 alleles
    15116 25  rsX  0   123412  G  A     G_A
    15118 25 rsX2  0   234324  A  G     A_G
    15119 25 rsX3  0     3423  A  G     A_G
    15121 25 rsX5  0 23523423  A  G     A_G