Iam trying to select only specific words from tokenized output in apache spark. Basically, what i want to achieve is Opposite of StopWordsRemover Feature in Spark Mlib. For Eg:
StopWordsRemover stopWords = new StopWordsRemover();
stopWords.setInputCol("BrokenDown");
stopWords.setOutputCol("Filtered");
stopWords.setStopWords(new String[]{"cashback","rs","minimum"});
Tokenizer tokenizer = new Tokenizer().setInputCol("DealDescription").setOutputCol("BrokenDown");
DataFrame dfTemp2 = tokenizer.transform(dfTemp1.select("Deals.MerchantName","Deals.DealDescription")
.filter(lower(col("DealDescription")).contains("cashback")));
The above code filters out words 'cashback','rs','minimum', however what i want is to only select only 'these' words and remove everything else that does not match.
Spark Version : 1.6.0
Kindly, suggest.
I was able to find a way to do this :
Used RegexTokenizer and setPattern() method.
RegexTokenizer tokenizer = new RegexTokenizer().setInputCol("DealDescription").setOutputCol("BrokenDown")
.setGaps(false).setPattern("cashback|rs|minimum");
With this i was able to select words 'cashback','rs','minimum'.