I'm cleaning a column of spanish text using the following function that use re
and unicodedata
:
def CleanText(texto: str) -> str:
texto = texto.lower()
texto = ''.join((c for c in unicodedata.normalize('NFD', texto) if unicodedata.category(c) != 'Mn'))
texto = re.sub(r'[^a-z0-9 \n\.,]', '', texto)
texto = re.sub(r'([.,])(?![\s])', r'\1 ', texto)
texto = re.sub(r'\s+', ' ', texto).strip()
texto = texto.replace('.', '')
texto = texto.replace(',', '')
return texto
And then i apply it to my Dataframe
using:
(
df
.with_columns(
pl.col("Comment").map_elements(CleanText,return_dtype=pl.String).alias("CleanedText")
)
)
However, since polars accept regex crate
i think i could just use polars to do the cleaning without needing to create auxiliar funcions.
How could i just use a polars expression to do the same?
Two things:
regex crate
doesn't support lookaheads, so I've adjusted the function a bit.pyarrow.compute.utf8_normalize()
. Check this question as well.def CleanExpr(texto: pl.Expr) -> pl.Expr:
texto = texto.str.to_lowercase()
#texto = ''.join((c for c in unicodedata.normalize('NFD', texto) if unicodedata.category(c) != 'Mn'))
texto = texto.str.replace_all(r'[^a-z0-9 \n\.,]', '')
texto = texto.str.replace_all('.', ' ', literal=True)
texto = texto.str.replace_all(',', ' ', literal=True)
texto = texto.str.replace_all(r'\s+', ' ').str.strip_chars()
texto = texto.str.replace_all('.', '', literal=True)
texto = texto.str.replace_all(',', '', literal=True)
return texto
(
df
.with_columns(
CleanExpr(pl.col("Comment")).alias("CleanedText")
)
)