Search code examples
apache-sparknlpapache-spark-mllibapache-spark-mln-gram

NGram on dataset with one word


I'm dabbling with SparkML, trying to build out a fuzzy match using Spark's OOB capabilities. Along the way, I'm building NGrams with n=2. However, some lines in my dataset contains single words where Spark pipeline fails. Regardless of Spark, wondering what would be the general approach to this problem. ie. what if tokens


Solution

  • SCALA approach. Normally it should work with 1 word as well and not fail, crash. Using non-MLLIB but sliding you get a bigram of 1 which is debatable of course, with sentence parsing. like this:

    val rdd = sc.parallelize(Array("Hello my Friend. How are",
                                   "you today? bye my friend.",
                                   "singleword"))
    rdd.map{ 
        // Split each line into substrings by periods
        _.split('.').map{ substrings =>
            // Trim substrings and then tokenize on spaces
            substrings.trim.split(' ').map{_.replaceAll("""\W""", "").toLowerCase()}.
            // Find bigrams, etc.
            sliding(2)
         }.
        // Flatten, and map the ngrams to concatenated strings
        flatMap{identity}.map{_.mkString(" ")}.
        // Group the bigrams and count their frequency
        groupBy{identity}.mapValues{_.size}
    }.
    // Reduce to get a global count, then collect.  
    flatMap{identity}.reduceByKey(_+_).collect.
    // Print
    foreach{x=> println(x._1 + ", " + x._2)}
    

    This does not fail on the "singleword" but gives you a single word:

    you today, 1
    hello my, 1
    singleword, 1
    my friend, 2
    how are, 1
    bye my, 1
    today bye, 1
    

    Using mllib and going over lines with this input:

    the quick brown fox.
    singleword.
    two words.
    

    using:

    import org.apache.spark.mllib.rdd.RDDFunctions._
    val wordsRdd = sc.textFile("/FileStore/tables/sliding.txt",1)
    val wordsRDDTextSplit = wordsRdd.map(line => (line.trim.split(" "))).flatMap(x => x).map(x => (x.toLowerCase())).map(x => x.replaceAll(",{1,}","")).map(x => x.replaceAll("!{1,}",".")).map(x => x.replaceAll("\\?{1,}",".")).map(x => x.replaceAll("\\.{1,}",".")).map(x => x.replaceAll("\\W+",".")).filter(_ != ".").filter(_ != "")
    .map(x => x.replace(".","")).sliding(2).collect
    

    you get:

     wordsRDDTextSplit: Array[Array[String]] = Array(Array(the, quick), Array(quick, brown), Array(brown, fox), Array(fox, singleword), Array(singleword, two), Array(two, words))
    

    Note I parse the lines differenty.

    When running the above with just one line with 1 word, then I get null output.

    wordsRDDTextSplit: Array[Array[String]] = Array()
    

    So, you see you can process over lines or not, etc.