Search code examples
rstringdplyrsplitgsub

how to keep everything before "." in R using sub


I have a dataframe in R:

structure(list(chr = c(1, 1, 1, 1, 1), gene_id = c("ENSG00000223972.5", 
"ENSG00000227232.5", "ENSG00000278267.1", "ENSG00000243485.5", 
"ENSG00000237613.2"), gene_name = c("DDX11L1", "WASH7P", "MIR6859-1", 
"MIR1302-2HG", "FAM138A"), start = c(11869, 14410, 17369, 29571, 
34554), end = c(14403, 29553, 17436, 31109, 36081), gene_type = c("transcribed_unprocessed_pseudogene", 
"unprocessed_pseudogene", "miRNA", "lincRNA", "lincRNA")), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))

I want to edit the gene name to only keep data before "." for example:

ENSG00000223972.5 to ENSG00000223972

I did this:

gene_annot_parsed1 <- sub(".*^.","",gene_annot_parsed$gene_id)

But it gives this output:

dput(gene_annot_parsed[1:2])
c("NSG00000223972.5", "NSG00000227232.5")

I just want to modify the gene_id column to anything after "." and keep rest of the column same In my case its removing "E" and removing other columns. Does anyone know how to solve this. Thank you.


Solution

  • gene_annot_parsed1  <- stringr::str_replace_all(gene_annot_parsed$gene_id, '(.*)\\.', '\\1')