I would like to tokenizer some chemical expression called SMILES
, for example, [c]1ccc(C(=O)Nc2ccc(Br)cc2)cc1[N+](=O)[O-].C[NH]
. There are no spaces in the string, and after the tokenization, we should get [c], 1, c, c, c, (, C, (, =, O, ), N, c, 2, c, c, c, (, Br, ), c, c, 2, ), c, c, 1, [N+], (, =, O, ), [O-], ., C, [NH]
, which means some special tokens have more than one character such as [c]
, Br
and [N+]
and they should not be split. Apart from these tokens, other tokens only have one character such as c
, (
and N
. How can I acheive this with a tokenizer from Spacy? If spacy is not needed here and only a snippet of python can do this, it would be also acceptable. Any help would be highly appreciated!
I think the regex for this is quite easy so
s = "[c]1ccc(C(=O)Nc2ccc(Br)cc2)cc1[N+](=O)[O-].C[NH]"
tokens = re.findall("\[.+?]|.",s)
I guess does what you want