I'm building a pipeline in Spark NLP (version 3.2.1) to create Tokens from a string column that contains searched words by words separated by comma.
documentAssemblerteste = DocumentAssembler() \
.setInputCol("searched_terms") \
regexTokenizerteste = Tokenizer() \
.setInputCols(["str_search_term_doc"]) \
finisherteste = Finisher()\
pipeline = Pipeline().setStages([
But it gives me the wrong expected output. For an example, a row that contains these searched words
"pizza, hot dog, supermarket"
['pizza', 'hot', 'dog', 'supermarket']
But i want it to ignore spaces and gives me the following output:
['pizza', 'hot dog', 'supermarket']
How can i achieve this result?
Spark ML provides a regexTokenizer class. To set the regex pattern, use the setPattern()
method or pass the pattern as a keyword argument to the constructor.
Just be aware that it uses Java dialect of regex, as mentioned here.