I'm trying to train the Stanford Neural Network Dependency Parser to check phrase similarity.
The way I tried is:
java edu.stanford.nlp.parser.nndep.DependencyParser -trainFile trainPath -devFile devPath -embedFile wordEmbeddingFile -embeddingSize wordEmbeddingDimensionality -model modelOutputFile.txt.gz
The error that I got is:
Train File: C:\Users\rohit\Downloads\CoreNLP-master\CoreNLP-master\data\edu\stanford\nlp\parser\trees\en-onetree.txt
Dev File: null
Model File: modelOutputFile.txt.gz
Embedding File: null
Pre-trained Model File: null
################### Train
#Trees: 1
0 tree(s) are illegal (0.00%).
1 tree(s) are legal but have multiple roots (100.00%).
0 tree(s) are legal but not projective (0.00%).
###################
#Word: 3
#POS:3
#Label: 2
###################
#Transitions: 3
#Labels: 1
ROOTLABEL: null
Random generator initialized with seed 1459831358061
Exception in thread "main" java.lang.NullPointerException
at edu.stanford.nlp.parser.nndep.Util.scaling(Util.java:49)
at edu.stanford.nlp.parser.nndep.DependencyParser.readEmbedFile. (DependencyParser.java:636)
at edu.stanford.nlp.parser.nndep.DependencyParser.setupClassifierForTraining(DependencyParser.java:787)
at edu.stanford.nlp.parser.nndep.DependencyParser.train(DependencyParser.java:676)
at edu.stanford.nlp.parser.nndep.DependencyParser.main(DependencyParser.java:1247)
The help embedded within the code says that the training file should be a - "Path to a training treebank in CoNLL-X format".
Does anyone know where I can find some CoNLL-X training data to train? I gave training file but not embedding file and got this error. My guess is if I give the embedding file it might work.
Please shed some light on which training file & embedding file I should use and where I can find them.
CoNLL-X treebanks
You can get the training data for Danish, Dutch, Portuguese, and Swedish available for free here. For other languages, you'll probably need to license a treebank from LDC, unfortunately (details for many languages on that page).
Universal Dependencies are in CoNLL-U format, which can usually be converted to CoNLL-X format with some work.
Lastly, there's a large list of treebanks and their availability on this page. You should be able to convert many of the dependency treebanks in this list into CoNLL-X format if they're not already in that format.
Training the Stanford Neural Net Dependency parser
From this page: The embedding file is optional, but the treebank is not. The best treebank and embedding files to use depend on which language and type of text you'd like to parse. Ideally, you would train on as much data as possible in the domain/genre that you're trying to parse.