As i am beginner to spark NLP, I started to do some hands on exercises using the functions which are displayed in the johnsnowlabs
I am using SCALA
from data bricks and i got a large text file from
So first I import necessary libraries and data as follows,
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val book = sc.textFile("/FileStore/tables/84_0-5b1ef.txt").collect()
val words=bookRDD.filter(x=>x.length>0).flatMap(line => line.split("""\W+"""))
val rddD = words.toDF("text")
How to use different Annotators which are available in johnsnowlabs based on my purpose ?
For example if I want to find stop-words, then I can use
val stopWordsCleaner = new StopWordsCleaner()
.setStopWords(Array("this", "is", "and"))
But I have no idea about how to use this and find stop words of my text file. Do i need to use a pre-trained model with the annotator ?
I found very difficult to find a good tutorial about this. So it is grateful if someone can provide some useful hints.
is the annotator to use to remove stop words.
Refer: Annotators
Stop Words maybe different for your text based on your context but generally all NLP Engines have a set of stop words which it would match and remove.
In JSL spark-nlp, you may also set your stop words using setStopWords
while using StopWordsCleaner.