Use regex function on Polars

I'm cleaning a column of spanish text using the following function that use re and unicodedata:

def CleanText(texto: str) -> str:
    texto = texto.lower()
    texto = ''.join((c for c in unicodedata.normalize('NFD', texto) if unicodedata.category(c) != 'Mn'))
    texto = re.sub(r'[^a-z0-9 \n\.,]', '', texto)
    texto = re.sub(r'([.,])(?![\s])', r'\1 ', texto)
    texto = re.sub(r'\s+', ' ', texto).strip()
    texto = texto.replace('.', '')
    texto = texto.replace(',', '')
    return texto

And then i apply it to my Dataframe using:

(
    df
    .with_columns(
        pl.col("Comment").map_elements(CleanText,return_dtype=pl.String).alias("CleanedText")
        )
)

However, since polars accept regex crate i think i could just use polars to do the cleaning without needing to create auxiliar funcions.

How could i just use a polars expression to do the same?

Solution

Two things:

currently regex crate doesn't support lookaheads, so I've adjusted the function a bit.
I don't think there's an native polars for normalizing unicode data yet, so you can either run or check this (undocumented?) pyarrow function - pyarrow.compute.utf8_normalize(). Check this question as well.

def CleanExpr(texto: pl.Expr) -> pl.Expr:
    texto = texto.str.to_lowercase()
    #texto = ''.join((c for c in unicodedata.normalize('NFD', texto) if unicodedata.category(c) != 'Mn'))
    texto = texto.str.replace_all(r'[^a-z0-9 \n\.,]', '')
    texto = texto.str.replace_all('.', ' ', literal=True)
    texto = texto.str.replace_all(',', ' ', literal=True)
    texto = texto.str.replace_all(r'\s+', ' ').str.strip_chars()
    texto = texto.str.replace_all('.', '', literal=True)
    texto = texto.str.replace_all(',', '', literal=True)
    return texto

(
    df
    .with_columns(
        CleanExpr(pl.col("Comment")).alias("CleanedText")
    )
)