apache-spark nlp apache-spark-mllib apache-spark-ml n-gram

NGram on dataset with one word

I'm dabbling with SparkML, trying to build out a fuzzy match using Spark's OOB capabilities. Along the way, I'm building NGrams with n=2. However, some lines in my dataset contains single words where Spark pipeline fails. Regardless of Spark, wondering what would be the general approach to this problem. ie. what if tokens

Solution

SCALA approach. Normally it should work with 1 word as well and not fail, crash. Using non-MLLIB but sliding you get a bigram of 1 which is debatable of course, with sentence parsing. like this:

val rdd = sc.parallelize(Array("Hello my Friend. How are",
                               "you today? bye my friend.",
                               "singleword"))
rdd.map{ 
    // Split each line into substrings by periods
    _.split('.').map{ substrings =>
        // Trim substrings and then tokenize on spaces
        substrings.trim.split(' ').map{_.replaceAll("""\W""", "").toLowerCase()}.
        // Find bigrams, etc.
        sliding(2)
     }.
    // Flatten, and map the ngrams to concatenated strings
    flatMap{identity}.map{_.mkString(" ")}.
    // Group the bigrams and count their frequency
    groupBy{identity}.mapValues{_.size}
}.
// Reduce to get a global count, then collect.  
flatMap{identity}.reduceByKey(_+_).collect.
// Print
foreach{x=> println(x._1 + ", " + x._2)}

This does not fail on the "singleword" but gives you a single word:

you today, 1
hello my, 1
singleword, 1
my friend, 2
how are, 1
bye my, 1
today bye, 1

Using mllib and going over lines with this input:

the quick brown fox.
singleword.
two words.

using:

import org.apache.spark.mllib.rdd.RDDFunctions._
val wordsRdd = sc.textFile("/FileStore/tables/sliding.txt",1)
val wordsRDDTextSplit = wordsRdd.map(line => (line.trim.split(" "))).flatMap(x => x).map(x => (x.toLowerCase())).map(x => x.replaceAll(",{1,}","")).map(x => x.replaceAll("!{1,}",".")).map(x => x.replaceAll("\\?{1,}",".")).map(x => x.replaceAll("\\.{1,}",".")).map(x => x.replaceAll("\\W+",".")).filter(_ != ".").filter(_ != "")
.map(x => x.replace(".","")).sliding(2).collect

you get:

 wordsRDDTextSplit: Array[Array[String]] = Array(Array(the, quick), Array(quick, brown), Array(brown, fox), Array(fox, singleword), Array(singleword, two), Array(two, words))

Note I parse the lines differenty.

When running the above with just one line with 1 word, then I get null output.

wordsRDDTextSplit: Array[Array[String]] = Array()

So, you see you can process over lines or not, etc.