Search code examples
pythonregexpython-polars

Use regex function on Polars


I'm cleaning a column of spanish text using the following function that use re and unicodedata:

def CleanText(texto: str) -> str:
    texto = texto.lower()
    texto = ''.join((c for c in unicodedata.normalize('NFD', texto) if unicodedata.category(c) != 'Mn'))
    texto = re.sub(r'[^a-z0-9 \n\.,]', '', texto)
    texto = re.sub(r'([.,])(?![\s])', r'\1 ', texto)
    texto = re.sub(r'\s+', ' ', texto).strip()
    texto = texto.replace('.', '')
    texto = texto.replace(',', '')
    return texto

And then i apply it to my Dataframe using:

(
    df
    .with_columns(
        pl.col("Comment").map_elements(CleanText,return_dtype=pl.String).alias("CleanedText")
        )
)

However, since polars accept regex crate i think i could just use polars to do the cleaning without needing to create auxiliar funcions.

How could i just use a polars expression to do the same?


Solution

  • Two things:

    • currently regex crate doesn't support lookaheads, so I've adjusted the function a bit.
    • I don't think there's an native polars for normalizing unicode data yet, so you can either run or check this (undocumented?) pyarrow function - pyarrow.compute.utf8_normalize(). Check this question as well.
    def CleanExpr(texto: pl.Expr) -> pl.Expr:
        texto = texto.str.to_lowercase()
        #texto = ''.join((c for c in unicodedata.normalize('NFD', texto) if unicodedata.category(c) != 'Mn'))
        texto = texto.str.replace_all(r'[^a-z0-9 \n\.,]', '')
        texto = texto.str.replace_all('.', ' ', literal=True)
        texto = texto.str.replace_all(',', ' ', literal=True)
        texto = texto.str.replace_all(r'\s+', ' ').str.strip_chars()
        texto = texto.str.replace_all('.', '', literal=True)
        texto = texto.str.replace_all(',', '', literal=True)
        return texto
    
    (
        df
        .with_columns(
            CleanExpr(pl.col("Comment")).alias("CleanedText")
        )
    )