Troubles using Stanford Arabic Segmenter

I'm having troubles running the Stanford Arabic segmenter in Windows 10, as whenever I try to process the command as mentioned in the readme file, it fails to load the segmenter data/arabic-segmenter-atb+bn+arztrain.ser.gz

I'm not very familiar with Java so I don't even know whether I understood the classpath matter correctly. Guess, I didn't. Also, I find the readme instructions slightly confusing.

Loaded ArabicTokenizer with options: null
loadClassifier=data/arabic-segmenter-atb+bn+arztrain.ser.gz
textFile=C:\Users\vmumm\OneDrive\Ulmo\Nizar\OLD\complete_NQ_new_April2019.txt
featureFactory=edu.stanford.nlp.international.arabic.process.StartAndEndArabicSegmenterFeatureFactory
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Failed to load segmenter data/arabic-segmenter-atb+bn+arztrain.ser.gz
        at edu.stanford.nlp.international.arabic.process.ArabicSegmenter.loadSegmenter(ArabicSegmenter.java:466)
        at edu.stanford.nlp.international.arabic.process.ArabicSegmenter.getSegmenter(ArabicSegmenter.java:629)
        at edu.stanford.nlp.international.arabic.process.ArabicSegmenter.main(ArabicSegmenter.java:532)
Caused by: java.io.IOException: Unable to open "data/arabic-segmenter-atb+bn+arztrain.ser.gz" as class path, filename or URL
        at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:480)
        at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1503)
        at edu.stanford.nlp.ie.crf.CRFClassifier.getClassifier(CRFClassifier.java:2939)
        at edu.stanford.nlp.international.arabic.process.ArabicSegmenter.loadSegmenter(ArabicSegmenter.java:464)

I guess, I just need a simple guide how to run the segmenter - assuming that I don't usually work with Java.

Solution

I'd recommend downloading the full Stanford CoreNLP package.

Download Stanford CoreNLP from here: https://stanfordnlp.github.io/CoreNLP/download.html

This should end up in a directory like:

C:\Users\myusername\stanford-corenlp-full-2018-10-05

Download the Arabic models jar from that same link, and move it to the Stanford CoreNLP directory C:\Users\myusername\stanford-corenlp-full-2018-10-05
Set CLASSPATH to include that directory which has all *.jar files you need.
```
set CLASSPATH=C:\Users\myusername\stanford-corenlp-full-2018-10-05\*;
```

Run a pipeline on example text (make sure to be in the directory with the example file when you run this command)

java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-arabic.properties -file example.txt -outputFormat text

You should get segmented output in example.txt.out when this command finishes.

I don't have access to a Windows machine, so if my answer doesn't work please let me know, and I'll fix it. I'll try to put up some documentation on our site about working with Windows.