Search code examples
stanford-nlp

next release of Stanza


I'm interested in the Stanza constituency parser for Italian. In https://stanfordnlp.github.io/stanza/constituency.html it is said that a new release with updated models (including an Italian model trained on the Turin treebank) should have been available in mid-November. Any idea about when the next release of Stanza will appear? Thanks alberto


Solution

  • Technically you can already get it! If you install the dev branch of stanza, you should be able to download an IT parser.

    pip install git+git://github.com/stanfordnlp/stanza.git@704d90df2418ee199d83c92c16de180aacccf5c0
    
    
    stanza.download("it")
    

    It's trained on the Turin treebank, which has about 4000 trees. If you download the Bert version of the model, it gets over 91 F1 on the Evalita test set (but has a length limit of about 200 words per sentence).

    We might splurge on getting the VIT treebank or something. I've been agitating that we use that budget on Danish or PT or some other language where we have very few users, but it's a hard sell...

    Edit: there's also some scripts included for converting the publicly available Turin trees into brackets. Their MWT annotation style was to repeat the MWT twice in a row, which doesn't doesn't work too well for a task like parsing raw text.