Search code examples
rregexgenetics

Seek table elements with a specific letter combination and remove the last few characters


I have a table of genes, some of which are duplicated. I'm not concerned about the duplication. The duplication comes from a transcript from Ensemble gene database. When this is reported, it includes the gene XXXX, then adds _ENST00000WWWWWW, this added portion is always 16 characters (where XXXXX is the gene name and WWWWWW are integers). This shows up as XXXX_ENST00000WWWWWW. There are simply too many instances to remove these manually. How can I loop through this, recognize this pattern, and remove those elements? Example below.

       Name              Gene.Name CDS.Mutation AA.Mutation
1       1A                 ASXL3     c.350G>T     p.S117I
2       1A ASXL3_ENST00000269197    c.1229G>T     p.S410I
3       1A              C9orf174      c.95G>A      p.R32Q
4       1A  CLTC_ENST00000269122    c.2128G>C     p.G710R
5       1A                COL4A4     c.274G>A      p.G92R
6       1A                CREBBP    c.4445A>G    p.Y1482C



       Name             Gene.Name CDS.Mutation AA.Mutation
1       1A                 ASXL3     c.350G>T     p.S117I
2       1A                 ASXL3    c.1229G>T     p.S410I
3       1A              C9orf174      c.95G>A      p.R32Q
4       1A                  CLTC    c.2128G>C     p.G710R
5       1A                COL4A4     c.274G>A      p.G92R
6       1A                CREBBP    c.4445A>G    p.Y1482C

This is kind of what I pulled together, but I may not know enough to understand this fully. I get a ton of errors on this.

raw <- read.xlsx("Mutations.xlsx")
for (i in 1:nrow(raw)){
  if (length(grep("ENST", raw[i,2])) == 1){
    raw[i,2] <- gsub('.{16}$','', raw[i,2])
  }
}

Solution

  • you can substitute:

     transform(df,Gene.Name=sub("_.*","",Gene.Name))
          Name Gene.Name CDS.Mutation AA.Mutation
        1   1A     ASXL3     c.350G>T     p.S117I
        2   1A     ASXL3    c.1229G>T     p.S410I
        3   1A  C9orf174      c.95G>A      p.R32Q
        4   1A      CLTC    c.2128G>C     p.G710R
        5   1A    COL4A4     c.274G>A      p.G92R
        6   1A    CREBBP    c.4445A>G    p.Y1482C