Search code examples
python-polars

Parsing data from Polars LazyFrame


Pre-requisites: I'm collecting large amounts of data in CSV files with two columns. For storage and speed I'm trying to convert them to Parquet.

What I'm trying to achieve:

  • Read a parquet file in a LazyFrame.
  • Iterate through every cell of a column
  • Extract the data from there, feed it to a function.
  • Save the output of the function in a list (the output is a dict)
  • Write the dicts to a new CSV file (I would also prefer to do this in a streaming way, because I can't hold in memory all the results)
df = pl.scan_parquet('big_file.pq')  
results = []
htmls = df.select(["html"]).collect(streaming=True)
counter = 0
for item in htmls:
    counter += 1
    if counter == 3:
        break
    result = parser(item)
    results.append(result)

With this code I end up with a series in my htmls variable, and I don't know how to iterate through it, searched the docs but unfortunately couldn't find a solution.

Demo csv ( the csv that I'm converting to Parquet before going into parsing)


Solution

  • if I understand correctly your need, you can use the map_elements function for this.

    Example:

    import polars as pl
    
    df = pl.scan_csv('test_5_lines.csv')
    
    def html_udf(html_string):
        return {
            'a': html_string[:5],
            'b': html_string[5:10],
            'c': html_string[4:12]
        }
    
    (
        df.select(
            pl.col('html').map_elements(html_udf))
        .unnest('html')
        .sink_csv('test_5_lines_result.csv')
    )
    
    # Here is the content of the resulting CSV file
    a,b,c
    CgoKC,goKIC,CgoKICA8
    CgoKC,goKIC,CgoKICA8
    CgoKC,goKIC,CgoKICA8
    CgoKC,goKIC,CgoKICA8
    

    Update: example with UDF returning a dict.