I'm using the Spark NLP pipeline to preprocess my data. Instead of only removing punctuation, the normalizer also removes umlauts.
My code:
documentAssembler = DocumentAssembler() \
.setInputCol("column") \
.setOutputCol("column_document")\
.setCleanupMode('shrink_full')
tokenizer = Tokenizer() \
.setInputCols(["column_document"]) \
.setOutputCol("column_token") \
.setMinLength(2)\
.setMaxLength(30)
normalizer = Normalizer() \
.setInputCols(["column_token"]) \
.setOutputCol("column_normalized")\
.setCleanupPatterns(["[^\w -]|_|-(?!\w)|(?<!\w)-"])\
.setLowercase(True)\
Example:
Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller, die schmecken besonders gut!
Output:
Ich esse gerne pfel vom Biobauernhof Reutter Mller die schmecken besonders gut
Expected Output:
Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller die schmecken besonders gut
The \w
pattern is not Unicode-aware by default, you need to make it Unicode-aware with a regex option. In this case, it is easier to do it with an embedded flag option (?U)
:
"(?U)[^\w -]|_|-(?!\w)|(?<!\w)-"
More details from the documentation:
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.
The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression
(?U)
.The flag implies UNICODE_CASE, that is, it enables Unicode-aware case folding.