python regex pyspark apache-spark-sql johnsnowlabs-spark-nlp

Regex in Spark NLP Normalizer is not working correctly

I'm using the Spark NLP pipeline to preprocess my data. Instead of only removing punctuation, the normalizer also removes umlauts.

My code:

documentAssembler = DocumentAssembler() \
    .setInputCol("column") \
    .setOutputCol("column_document")\
    .setCleanupMode('shrink_full')

tokenizer = Tokenizer() \
  .setInputCols(["column_document"]) \
  .setOutputCol("column_token") \
  .setMinLength(2)\
  .setMaxLength(30)
  
normalizer = Normalizer() \
    .setInputCols(["column_token"]) \
    .setOutputCol("column_normalized")\
    .setCleanupPatterns(["[^\w -]|_|-(?!\w)|(?<!\w)-"])\
    .setLowercase(True)\

Example:

Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller, die schmecken besonders gut!

Output:

Ich esse gerne pfel vom Biobauernhof Reutter Mller die schmecken besonders gut

Expected Output:

Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller die schmecken besonders gut

Solution

The \w pattern is not Unicode-aware by default, you need to make it Unicode-aware with a regex option. In this case, it is easier to do it with an embedded flag option (?U):

"(?U)[^\w -]|_|-(?!\w)|(?<!\w)-"