Search code examples
pythontokenizespacy

Tokenize a String without spaces using a custom tokenizer in Spacy


I would like to tokenizer some chemical expression called SMILES, for example, [c]1ccc(C(=O)Nc2ccc(Br)cc2)cc1[N+](=O)[O-].C[NH]. There are no spaces in the string, and after the tokenization, we should get [c], 1, c, c, c, (, C, (, =, O, ), N, c, 2, c, c, c, (, Br, ), c, c, 2, ), c, c, 1, [N+], (, =, O, ), [O-], ., C, [NH], which means some special tokens have more than one character such as [c], Br and [N+] and they should not be split. Apart from these tokens, other tokens only have one character such as c, ( and N. How can I acheive this with a tokenizer from Spacy? If spacy is not needed here and only a snippet of python can do this, it would be also acceptable. Any help would be highly appreciated!


Solution

  • I think the regex for this is quite easy so

    s = "[c]1ccc(C(=O)Nc2ccc(Br)cc2)cc1[N+](=O)[O-].C[NH]"
    tokens = re.findall("\[.+?]|.",s)
    

    I guess does what you want