I can't get an regular expression task working, it would be great if someone could help.
I need to separate gene names from descriptions that are attached to them. Using a term that appeared in 99% of cases involved separating it from "GeneCards Summary", solvable using tidyverse via gene <- str_split (DF$Gene, "GeneCards Summary", simplify = TRUE)
. But now there are some that do not follow this pattern, exemplified here:
example <- c("STAT1Predisposition to Mucocutaneous Candidiasis",
"PMS2DNA Repair DefectsPMS2 Deficiency",
"FANCACombined ImmunodeficiencyFANCA",
"HAX1 This gene", "ELANE ELANE is a gene",
"IL1RNNon-Inflammasome Related", "PRKDCT-B- SCIDDNA PKcs",
"MSH6Severe Reduction", "AP3B1FHL Syndromes")
I was able to make out the following patterns, hopefully this covers all of them (unlikely but with your solution I should get the rest as well if they pop up):
1) Genename followed by a word containing UPPERCASElowerlase (So separate this part from the part before).
2) GenenameDNA (Seperate "DNA" from part before.
3) genename"" (empty space)
4) genenameT-B-.
5) genenameFHL.
Actually the trickiest is the UPPERCASe lowercase part, the others I will try to solve and post here.
Thanks a lot for your help!
Sebastian
Here is part of my solution without the upper/lower one:
clean_1 <- str_split(example, "DNA", simplify = T)
clean_2 <- str_split(clean_1, "[[:blank:]]", simplify = T)
clean_3 <- str_split(clean_2, "T-B", simplify = T)
clean_4 <- str_split(clean_3, "FHL", simplify = T)
I would do this each round to get the data cleaned up but there is probably a better way to do this.
Assuming that your example is representative of all possibilities, what you have is:
So a solution is: extract the first word in each string, then identify the cases where there's words attached (one upper case followed by lower cases) and delete them. To keep using package stringr:
library(stringr)
# Extract any characters before the first space:
fWord <- str_extract(example, '([^[:blank:]]+)')
# Find the index of strings that have lower cases:
ind <- grep('[:lower:]', fWord)
# Select everything until the first lower caseand remove the last character:
fWord[ind] <- str_sub(str_extract(fWord[ind], '([^[:lower:]]+)' ), end = -2)
> fWord
[1] "STAT1" "PMS2DNA" "FANCA" "HAX1" "ELANE" "IL1RN"
[7] "PRKDCT-B-" "MSH6" "AP3B1FHL"
I'm pretty sure that this can be done in one line. Try to make your question more clear and probably someone will present some fancy regular expression that get the job done.