Search code examples
rsplitextractmixed

Extracting numeric values from mixed variable column in R


Very new to R and coding in general, so having some serious issues! I have a dataframe containing 1,887,181 entries from a GWAS. I am trying to use the qqman package in order to create a manhattan plot. The issue I am having is that the data I need to create a BP column is given in a mixed string format with the chromosome number in an ID column. For the top associated SNPs (of which there are only 45 entries) I've been using as.numeric_version() in order to extract the numeric part from the ID column which works but is definitely an incorrect way to go about it, as it gives the following error, but still outputs the correct values:

BP<-as.numeric_version(PDLS_top_mp$CHR)

Error: invalid version specification ‘2R_19509740_SNP’,
‘2R_19509741_SNP’, ‘2R_15116838_SNP’, ‘3R_10159602_SNP’, 
‘2L_2855787_INS’, ‘X_11169099_SNP’, ‘2L_2855805_SNP’, 
‘3R_19567990_SNP’, ‘2R_19751690_SNP’, ‘2L_17155446_INS’,
‘3R_4610617_SNP’, ‘3L_1114964_SNP’, ‘2L_17155370_SNP’, 
‘3L_3548725_SNP’, ‘2R_16873693_SNP’, ‘2R_15527873_SNP’, 
‘3L_3548727_SNP’, ‘2L_2855816_SNP’, ‘3R_19576606_SNP’, 
‘3L_11508729_DEL’, ‘2L_17155422_SNP’, ‘3R_22503130_INS’, 
‘3R_4610630_SNP’, ‘2R_12331096_SNP’, ‘3R_8777795_SNP’, 
‘3L_10615500_SNP’, ‘X_17577589_SNP’, ‘X_17577622_SNP’, 
‘2L_7047925_SNP’, ‘2L_7047917_SNP’, ‘3R_9564478_SNP’, ‘3R_9590636_SNP’, 
‘3R_3614137_SNP’, ‘2R_4682340_INS’, ‘3L_6056407_SNP’, 
‘3L_11503648_SNP’, ‘3R_9580772_SNP’, ‘2L_18252469_SNP’, 
‘2L_7047939_SNP’, ‘3R_25530801_DEL’, ‘3L_1111454_SNP’, 
‘2L_17324557_SNP’, ‘3L_9799858_SNP�

I attempted to set a regexp and replace the values by using the following which also didn't work:

regexp<-"_[[:digit:]]_+"
s<-str_extract(PDLS_top_mp$ID,regexp)
View(s)
Error in View : 'names' attribute [2] must be the same length as the vector [1]

The as.numeric_version() doesn't work for the very large data frame containing all the associated SNPs it just returns NA for every entry. Here's what the all associated SNP data frame looks like:

 print(head(PDLS_ALL_dtf))
 # A tibble: 6 × 3

      ID      SinglePval SingleMixedPval
    <chr>      <dbl>           <dbl>
  2L_5372_SNP     0.6712        7.328082
  2L_5390_SNP     0.8176        8.459388
  2L_5403_SNP     0.9967        8.304397
  2L_5465_SNP     0.7897        6.187142
  2L_5516_SNP     0.4967        5.223919
  2L_5598_SNP     0.7099        5.169412

I would like it to end up like this: (where there are 5 different chromosome IDs, X, 2L, 2R, 3L, 3R and 4 all in the above format)

    ID       BP    SinglePval SingleMixedPval
   <chr>   <num>    <dbl>           <dbl>
 1       2L       5372   0.6712        7.328082
 2       2L       5390   0.8176        8.459388
 3       2L       5403   0.9967        8.304397
 4       2L       5465   0.7897        6.187142
 5       2L       5516   0.4967        5.223919
 6       2L       5598   0.7099        5.169412

Any help would be GREATLY appreciated!


Solution

  • Could use dplyr/tidyr:

    df_new<-df%>%separate(ID,into=c("ID","BP"),sep="_")
    

    Note : There will be a warning because of the second underscore (_SNP). But I am guessing it doesn't matter in this case.

    Warning message:
    Too many values at 7 locations: 1, 2, 3, 4, 5, 6, 7 
    

    In case you want to keep it:

    df_new<-df%>%separate(ID,into=c("ID","BP"),extra="merge")
    

    or drop it:

    df_new<-df%>%separate(ID,into=c("ID","BP"),extra="drop")
    
    df_new
      ID   BP SinglePval SingleMixedPval
    1 2L 5372     0.6712        7.328082
    2 2L 5390     0.8176        8.459388
    3 2L 5403     0.9967        8.304397
    4 2L 5465     0.7897        6.187142
    5 2L 5516     0.4967        5.223919
    6 2L 5598     0.7099        5.169412
    7 2L 5598     0.7099        5.169412