Search code examples
pythonpython-polars

In polars, is there a way to remove character accents from string columns?


I want to remove character accents from a text column, ex. convert Piña to Pina.
This is how I would do it in pandas:

(names
 .str.normalize('NFKD')
 .str.encode('ascii', errors='ignore')
 .str.decode('utf-8'))

Polars has str.decode and str.encode but they don't seem to be what i'm looking for.
Thanks!


Solution

  • To expand on @jqurious's comment you can do one of two things:

    1. map_elements/lambda

    like this:

    from unicodedata import normalize 
    df.with_columns(
        a=pl.col('a')
            .map_elements(lambda x: normalize('NFKD',x)
                            .encode('ascii', errors='ignore')
                            .decode('utf-8')))
    
    1. define function/map_batches

    like this:

    from unicodedata import normalize 
    def custnorm(In_series):
        for i, x in enumerate(In_series):
            newvalue = normalize('NFKD',x).encode('ascii', errors='ignore').decode('utf-8')
            if newvalue != x:
                In_series[i]=newvalue
        return In_series
    

    then inside the df you can do

    df.with_columns(a=pl.col('a').map_batches(custnorm))
    

    The difference between map_elements and map_batches is that map_elements tells polars to do the looping one row at a time whereas map_batches tells polars to feed the whole column as a Series to the function which must then return a Series of the same size.