Search code examples
apache-sparktokenize

Selecting Only Specific Words from tokenized output [Spark]


Iam trying to select only specific words from tokenized output in apache spark. Basically, what i want to achieve is Opposite of StopWordsRemover Feature in Spark Mlib. For Eg:

        StopWordsRemover stopWords = new StopWordsRemover();

        stopWords.setInputCol("BrokenDown");
        stopWords.setOutputCol("Filtered");
        stopWords.setStopWords(new String[]{"cashback","rs","minimum"});

        Tokenizer tokenizer = new Tokenizer().setInputCol("DealDescription").setOutputCol("BrokenDown");

        DataFrame dfTemp2 = tokenizer.transform(dfTemp1.select("Deals.MerchantName","Deals.DealDescription")
                                        .filter(lower(col("DealDescription")).contains("cashback")));

The above code filters out words 'cashback','rs','minimum', however what i want is to only select only 'these' words and remove everything else that does not match.

Spark Version : 1.6.0

Kindly, suggest.


Solution

  • I was able to find a way to do this :

    Used RegexTokenizer and setPattern() method.

    RegexTokenizer tokenizer = new RegexTokenizer().setInputCol("DealDescription").setOutputCol("BrokenDown")
                                            .setGaps(false).setPattern("cashback|rs|minimum");
    

    With this i was able to select words 'cashback','rs','minimum'.