I have a table of genes, some of which are duplicated. I'm not concerned about the duplication. The duplication comes from a transcript from Ensemble gene database. When this is reported, it includes the gene XXXX, then adds _ENST00000WWWWWW, this added portion is always 16 characters (where XXXXX is the gene name and WWWWWW are integers). This shows up as XXXX_ENST00000WWWWWW. There are simply too many instances to remove these manually. How can I loop through this, recognize this pattern, and remove those elements? Example below.
Name Gene.Name CDS.Mutation AA.Mutation
1 1A ASXL3 c.350G>T p.S117I
2 1A ASXL3_ENST00000269197 c.1229G>T p.S410I
3 1A C9orf174 c.95G>A p.R32Q
4 1A CLTC_ENST00000269122 c.2128G>C p.G710R
5 1A COL4A4 c.274G>A p.G92R
6 1A CREBBP c.4445A>G p.Y1482C
Name Gene.Name CDS.Mutation AA.Mutation
1 1A ASXL3 c.350G>T p.S117I
2 1A ASXL3 c.1229G>T p.S410I
3 1A C9orf174 c.95G>A p.R32Q
4 1A CLTC c.2128G>C p.G710R
5 1A COL4A4 c.274G>A p.G92R
6 1A CREBBP c.4445A>G p.Y1482C
This is kind of what I pulled together, but I may not know enough to understand this fully. I get a ton of errors on this.
raw <- read.xlsx("Mutations.xlsx")
for (i in 1:nrow(raw)){
if (length(grep("ENST", raw[i,2])) == 1){
raw[i,2] <- gsub('.{16}$','', raw[i,2])
}
}
you can substitute:
transform(df,Gene.Name=sub("_.*","",Gene.Name))
Name Gene.Name CDS.Mutation AA.Mutation
1 1A ASXL3 c.350G>T p.S117I
2 1A ASXL3 c.1229G>T p.S410I
3 1A C9orf174 c.95G>A p.R32Q
4 1A CLTC c.2128G>C p.G710R
5 1A COL4A4 c.274G>A p.G92R
6 1A CREBBP c.4445A>G p.Y1482C