Search code examples
pythondictionarytoken

Python2 tokenization and add to dictonary


I have some texts that I need to generate tokens splitting by space. Furthermore, I need to remove all punctuation, as I need to remove everything inside double braces [[...]] (including the double braces).

Each token I will put on a dictionary as the key that will have a list of values.

I have tried regex to remove these double braces patterns, if-elses, but I can't find a solution that works. For the moment I have:

tokenDic = dict()
splittedWords =  re.findall(r'\[\[\s*([^][]*?)]]',  docs[doc], re.IGNORECASE) 
tokenStr = splittedWords.split()

for token in tokenStr:
    tokenDic[token].append(value);

Solution

  • To remove everything inside [[]] you can use re.sub and you already have the correct regex so just do this.

     x = [[hello]]w&o%r*ld^$
     y = re.sub("\[\[\s*([^][]*?)]]","",x)
     z = re.sub("[^a-zA-Z\s]","",y)
     print(z)
    

    This prints "world"