Search code examples
stanford-nlp

Does Stanford Core NLP support Russian sentence and word tokenization?


I could not see any Russian pre-trained tokenizer in Sandford-NLP and stanfordCoreNLP. Are there any models for Russian yet?


Solution

  • Unfortunately I don't know of any extensions that handle that for Stanford CoreNLP.

    You can use Stanza (https://stanfordnlp.github.io/stanza/) which is our Python package to get Russian tokenization and sentence splitting.

    You could theoretically tokenize and sentence split with Stanza, and then use the Stanford CoreNLP Server (which you can also use via Stanza) if you had any CoreNLP specific components you wanted to work with.

    A group a while back submitted some models for Russian, but I don't see anything for tokenization.

    The link to their resources is here: https://stanfordnlp.github.io/CoreNLP/model-zoo.html