Search code examples
pysparktokenizejohnsnowlabs-spark-nlp

How to set Tokenizer() function of Spark NLP to split tokens by comma?


I'm building a pipeline in Spark NLP (version 3.2.1) to create Tokens from a string column that contains searched words by words separated by comma.

documentAssemblerteste = DocumentAssembler() \
.setInputCol("searched_terms") \
.setOutputCol("searched_terms_doc")\
.setCleanupMode("shrink_full")

regexTokenizerteste = Tokenizer() \
.setInputCols(["str_search_term_doc"]) \
.setSplitChars([","])\
.setOutputCol("token")

finisherteste = Finisher()\
.setInputCols("token")\
.setOutputCols("token_final")

pipeline = Pipeline().setStages([
documentAssemblerteste,    
regexTokenizerteste,
finisherteste
])

But it gives me the wrong expected output. For an example, a row that contains these searched words

"pizza, hot dog, supermarket"

returns:

['pizza', 'hot', 'dog', 'supermarket']

But i want it to ignore spaces and gives me the following output:

['pizza', 'hot dog', 'supermarket']

How can i achieve this result?


Solution

  • Spark ML provides a regexTokenizer class. To set the regex pattern, use the setPattern() method or pass the pattern as a keyword argument to the constructor.

    Just be aware that it uses Java dialect of regex, as mentioned here.