Search code examples
johnsnowlabs-spark-nlp

How should we use the setDictionary for the lemmatization annotator in Spark-NLP?


I have a requirement where I have to add a dictionary in the lemmatization step. While trying to use it in a pipeline and doing pipeline.fit() I get a arrayIndexOutOfBounds exception. What is the correct way to implement this? are there any examples?

I am passing token as the inputcol for lemmatization and lemma as the outputcol. Following is my code:

    // DocumentAssembler annotator
    val document = new DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
    // SentenceDetector annotator
    val sentenceDetector = new SentenceDetector()
        .setInputCols("document")
        .setOutputCol("sentence")
    // tokenizer annotaor
    val token = new Tokenizer()
        .setInputCols("sentence")
        .setOutputCol("token")
    import com.johnsnowlabs.nlp.util.io.ExternalResource
     // lemmatizer annotator
    val lemmatizer = new Lemmatizer()
        .setInputCols(Array("token"))
        .setOutputCol("lemma")
     .setDictionary(ExternalResource("C:/data/notebook/lemmas001.txt","LINE_BY_LINE",Map("keyDelimiter"->",","valueDelimiter"->"|")))
    val pipeline = new Pipeline().setStages(Array(document,sentenceDetector,token,lemmatizer))
    val result= pipeline.fit(df).transform(df)

The error message is:

    Name: java.lang.ArrayIndexOutOfBoundsException
    Message: 1
    StackTrace:   at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1$$anonfun$apply$14.apply(ResourceHelper.scala:315)
      at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1$$anonfun$apply$14.apply(ResourceHelper.scala:312)
      at scala.collection.Iterator$class.foreach(Iterator.scala:891)
      at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
      at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1.apply(ResourceHelper.scala:312)
      at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1.apply(ResourceHelper.scala:312)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      at com.johnsnowlabs.nlp.util.io.ResourceHelper$.flattenRevertValuesAsKeys(ResourceHelper.scala:312)
      at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:52)
      at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:19)
      at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:45)
      at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
      at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
      at scala.collection.Iterator$class.foreach(Iterator.scala:891)
      at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
      at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
      at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
      at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)

Solution

  • Your pipeline looks good to me so everything depends on what is inside lemmas001.txt and are you being able to access it on Windows.

    NOTE: I have seen users on Windows using this inside Apache Spark:

    "C:\\Users\\something\\Desktop\\someDirectory\\somefile.txt"
    

    How to train Lemmatizer in Spark NLP is simple:

    val lemmatizer = new Lemmatizer()
        .setInputCols(Array("token"))
        .setOutputCol("lemma")
        .setDictionary("AntBNC_lemmas_ver_001.txt", "->", "\t")
    

    The file must have the following format where the keyDelimiter in this case is -> and the valueDelimiter is \t:

    abnormal    ->  abnormal    abnormals
    abode   ->  abode   abodes
    abolish ->  abolishing  abolished   abolish abolishes
    abolitionist    ->  abolitionist    abolitionists
    abominate   ->  abominate   abominated  abominates
    abomination ->  abomination abominations
    aboriginal  ->  aboriginal  aboriginals
    aborigine   ->  aborigines  aborigine
    abort   ->  aborted abort   aborts  aborting
    abortifacient   ->  abortifacients  abortifacient
    abortionist ->  abortionist abortionists
    abortion    ->  abortion    abortions
    abo ->  abo abos
    abotrite    ->  abotrites   abotrite
    abound  ->  abound  abounds abounding   abounded
    

    Also, if you don't want to train your own Lemmatizer, you can use the pre-trained models as follow:

    English

    val lemmatizer = new LemmatizerModel.pretrained(name="lemma_antbnc", lang="en")
        .setInputCols(Array("token"))
        .setOutputCol("lemma")
    

    French

    val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="fr")
        .setInputCols(Array("token"))
        .setOutputCol("lemma")
    

    Italian

    val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="it")
        .setInputCols(Array("token"))
        .setOutputCol("lemma")
    

    German

    val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="de")
        .setInputCols(Array("token"))
        .setOutputCol("lemma")
    

    List of all pre-trained models is here: https://nlp.johnsnowlabs.com/docs/en/models

    List of all pre-trained pipelines is here: https://nlp.johnsnowlabs.com/docs/en/pipelines

    Please let me know in the comment if you have more questions.

    Full disclosure: I am one of the contributors of Spark NLP library.

    Update: I found this example for you in Scala on Databricks in case you are interested (This is actually how they trained Italian Lemmatizer model)