Search code examples
pythonpython-polars

Python Polars consuming high memory and taking longer time


This is what I'm trying to do.

  1. Scan the csv using Polars lazy dataframe
  2. Format the phone number using a function
  3. Remove nulls and duplicates
  4. Write the csv in a new file

Here is my code

import sys
import json
import polars as pl
import phonenumbers

#define the variable and parse the encoded json
args = json.loads(sys.argv[1])

#format phone number as E164
def parse_phone_number(phone_number):
    try:
        return phonenumbers.format_number(phonenumbers.parse(phone_number, "US"), phonenumbers.PhoneNumberFormat.E164)
    except phonenumbers.NumberParseException:
        pass
    return None

#scan the csv file do some filter and modify the data and then write the output to a new csv file
pl.scan_csv(args['path'], separator=args['delimiter']).select(
    [args['column']]
).with_columns(
    #convert the int phne number as string and apply the parse_phone_number function
    pl.col(args['column']).cast(pl.String).map_elements(parse_phone_number).alias(args['column']),
    #add another column list_id with value 100
    pl.lit(args['list_id']).alias("list_id")
    
).filter(
    #filter nulls
    pl.col(args['column']).is_not_null()
).unique(keep="last").collect().write_csv(args['saved_path'], separator=",")

I tested a file with 800k rows and 23 columns (150mb) and it takes around 20 seconds and more than 500mb ram then it completes the task.

Is this normal? Can I optimize the performance (the memory usage at least)?

I'm really new with Polars and I work with PHP and I'm very noob at python too, so sorry if my code looks bit dumb haha.


Solution

  • You are using map_elements, which means you are effectively writing a python for loop. This often is 10-100x slower than using expressions.

    Try to avoid map_elements. And if you do use map_elements, don't expect it to be fast.

    P.S. you can reduce memory usage by not casting the whole column to String, but instead cast inside your map_elements function. Though I don't think using 500MB is that high. Ideally polars uses as much RAM as available without going OOM. Unused RAM might be wasted potential.