I want to remove character accents from a text column, ex. convert Piña to Pina.
This is how I would do it in pandas:
(names
.str.normalize('NFKD')
.str.encode('ascii', errors='ignore')
.str.decode('utf-8'))
Polars has str.decode and str.encode but they don't seem to be what i'm looking for.
Thanks!
To expand on @jqurious's comment you can do one of two things:
like this:
from unicodedata import normalize
df.with_columns(
a=pl.col('a')
.map_elements(lambda x: normalize('NFKD',x)
.encode('ascii', errors='ignore')
.decode('utf-8')))
like this:
from unicodedata import normalize
def custnorm(In_series):
for i, x in enumerate(In_series):
newvalue = normalize('NFKD',x).encode('ascii', errors='ignore').decode('utf-8')
if newvalue != x:
In_series[i]=newvalue
return In_series
then inside the df you can do
df.with_columns(a=pl.col('a').map_batches(custnorm))
The difference between map_elements and map_batches is that map_elements tells polars to do the looping one row at a time whereas map_batches tells polars to feed the whole column as a Series
to the function which must then return a Series
of the same size.