I'm building a pipeline in Spark NLP (version 3.2.1) to create Tokens from a string column that contains searched words by words separated by comma.
documentAssemblerteste = DocumentAssembler() \
.setInputCol("searched_terms") \
.setOutputCol("searched_terms_doc")\
.setCleanupMode("shrink_full")
regexTokenizerteste = Tokenizer() \
.setInputCols(["str_search_term_doc"]) \
.setSplitChars([","])\
.setOutputCol("token")
finisherteste = Finisher()\
.setInputCols("token")\
.setOutputCols("token_final")
pipeline = Pipeline().setStages([
documentAssemblerteste,
regexTokenizerteste,
finisherteste
])
But it gives me the wrong expected output. For an example, a row that contains these searched words
"pizza, hot dog, supermarket"
returns:
['pizza', 'hot', 'dog', 'supermarket']
But i want it to ignore spaces and gives me the following output:
['pizza', 'hot dog', 'supermarket']
How can i achieve this result?
Spark ML provides a regexTokenizer class. To set the regex pattern, use the setPattern()
method or pass the pattern as a keyword argument to the constructor.
Just be aware that it uses Java dialect of regex, as mentioned here.