Python Polars consuming high memory and taking longer time

This is what I'm trying to do.

Scan the csv using Polars lazy dataframe
Format the phone number using a function
Remove nulls and duplicates
Write the csv in a new file

Here is my code

import sys
import json
import polars as pl
import phonenumbers

#define the variable and parse the encoded json
args = json.loads(sys.argv[1])

#format phone number as E164
def parse_phone_number(phone_number):
    try:
        return phonenumbers.format_number(phonenumbers.parse(phone_number, "US"), phonenumbers.PhoneNumberFormat.E164)
    except phonenumbers.NumberParseException:
        pass
    return None

#scan the csv file do some filter and modify the data and then write the output to a new csv file
pl.scan_csv(args['path'], separator=args['delimiter']).select(
    [args['column']]
).with_columns(
    #convert the int phne number as string and apply the parse_phone_number function
    pl.col(args['column']).cast(pl.String).map_elements(parse_phone_number).alias(args['column']),
    #add another column list_id with value 100
    pl.lit(args['list_id']).alias("list_id")
    
).filter(
    #filter nulls
    pl.col(args['column']).is_not_null()
).unique(keep="last").collect().write_csv(args['saved_path'], separator=",")

I tested a file with 800k rows and 23 columns (150mb) and it takes around 20 seconds and more than 500mb ram then it completes the task.

Is this normal? Can I optimize the performance (the memory usage at least)?

I'm really new with Polars and I work with PHP and I'm very noob at python too, so sorry if my code looks bit dumb haha.

Solution

You are using map_elements, which means you are effectively writing a python for loop. This often is 10-100x slower than using expressions.

Try to avoid map_elements. And if you do use map_elements, don't expect it to be fast.

P.S. you can reduce memory usage by not casting the whole column to String, but instead cast inside your map_elements function. Though I don't think using 500MB is that high. Ideally polars uses as much RAM as available without going OOM. Unused RAM might be wasted potential.