Search code examples
regexscalaapache-sparktokenize

Tokenize a sentence where each word contains only letters using RegexTokenizer Scala


I am using spark with scala and trying to tokenize a sentence where each word should only contain letters. Here is my code

def tokenization(extractedText: String): DataFrame = {

    val existingSparkSession = SparkSession.builder().getOrCreate()
    val textDataFrame = existingSparkSession.createDataFrame(Seq(
      (0, extractedText))).toDF("id", "sentence")
    val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
    val regexTokenizer = new RegexTokenizer()
      .setInputCol("sentence")
      .setOutputCol("words")
      .setPattern("\\W")
    val regexTokenized = regexTokenizer.transform(textDataFrame)
    regexTokenized.select("sentence", "words").show(false)
    return regexTokenized;
  }

If I provide senetence as "I am going to school5" after tokenization it should have only [i, am, going, to] and should drop school5. But with my current pattern it doesn't ignore the digits within words. How am I suppose to drop words with digits ?


Solution

  • You can use the settings below to get your desired tokenization. Essentially you extract words which only contain letters using an appropriate regex pattern.

    val regexTokenizer = new RegexTokenizer().setInputCol("sentence").setOutputCol("words").setGaps(false).setPattern("\\b[a-zA-Z]+\\b")
    
    val regexTokenized = regexTokenizer.transform(textDataFrame)
    
    regexTokenized.show(false)
    +---+---------------------+------------------+
    |id |sentence             |words             |
    +---+---------------------+------------------+
    |0  |I am going to school5|[i, am, going, to]|
    +---+---------------------+------------------+
    

    For the reason why I set gaps to false, see the docs:

    A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

    You want to repeatedly match the regex, rather than splitting the text by a given regex.