This is what I'm trying to do.
Here is my code
import sys
import json
import polars as pl
import phonenumbers
#define the variable and parse the encoded json
args = json.loads(sys.argv[1])
#format phone number as E164
def parse_phone_number(phone_number):
try:
return phonenumbers.format_number(phonenumbers.parse(phone_number, "US"), phonenumbers.PhoneNumberFormat.E164)
except phonenumbers.NumberParseException:
pass
return None
#scan the csv file do some filter and modify the data and then write the output to a new csv file
pl.scan_csv(args['path'], separator=args['delimiter']).select(
[args['column']]
).with_columns(
#convert the int phne number as string and apply the parse_phone_number function
pl.col(args['column']).cast(pl.String).map_elements(parse_phone_number).alias(args['column']),
#add another column list_id with value 100
pl.lit(args['list_id']).alias("list_id")
).filter(
#filter nulls
pl.col(args['column']).is_not_null()
).unique(keep="last").collect().write_csv(args['saved_path'], separator=",")
I tested a file with 800k rows and 23 columns (150mb) and it takes around 20 seconds and more than 500mb ram then it completes the task.
Is this normal? Can I optimize the performance (the memory usage at least)?
I'm really new with Polars and I work with PHP and I'm very noob at python too, so sorry if my code looks bit dumb haha.
You are using map_elements
, which means you are effectively writing a python for loop. This often is 10-100x slower than using expressions.
Try to avoid map_elements
. And if you do use map_elements
, don't expect it to be fast.
P.S. you can reduce memory usage by not casting the whole column to String
, but instead cast inside your map_elements
function. Though I don't think using 500MB is that high. Ideally polars uses as much RAM as available without going OOM. Unused RAM might be wasted potential.