I have some texts that I need to generate tokens splitting by space. Furthermore, I need to remove all punctuation, as I need to remove everything inside double braces [[...]] (including the double braces).
Each token I will put on a dictionary as the key that will have a list of values.
I have tried regex to remove these double braces patterns, if-elses, but I can't find a solution that works. For the moment I have:
tokenDic = dict()
splittedWords = re.findall(r'\[\[\s*([^][]*?)]]', docs[doc], re.IGNORECASE)
tokenStr = splittedWords.split()
for token in tokenStr:
tokenDic[token].append(value);
To remove everything inside [[]] you can use re.sub and you already have the correct regex so just do this.
x = [[hello]]w&o%r*ld^$
y = re.sub("\[\[\s*([^][]*?)]]","",x)
z = re.sub("[^a-zA-Z\s]","",y)
print(z)
This prints "world"