Search code examples
machine-learningtraining-datapos-taggersyntaxnetdependency-parsing

How much data is required to train SyntaxNet?


I know the more data, the better it's but what would be a reasonable amount of data required to train SyntaxNet?


Solution

  • Based on some trial and error, I have arrived at the following minimums:

  • Train corpus - 18,000 tokens (anything less than that and step 2 - Preprocessing with the Tagger- fails)
  • Test corpus - 2,000 tokens (anything less than that and step 2 - Preprocessing with the Tagger - fails)
  • Dev corpus - 2,000 tokens

    But please note that with this, I've only managed to get the steps in the NLP pipeline to run, I actually haven't managed to get anything usable out of it.