Search code examples
pythonpython-polars

polars python: split string in lazyframe column into new columns


I could not find any helpful info on that so if anyone has some input.... please....

I need to split all string in a column in a big (ca 30 GB) csv file. For that I tried out polars. Seems to work fine but I dont understand how I can map the values from the list I get after splitting the string into separate columns with lazyframes. Any input?

What I have:


import polars as pl

# Create a LazyFrame 
lf = pl.LazyFrame({"sample": ["ENST123456768.19", "ENST987654321.20", "ENST567890123.21"], "banananana": ["chiqita", "baaan", "banabana"]})
# Define a function to split the "sample" column and create new columns

lf = lf.with_columns(pl.struct(pl.col('sample').str.split('.'))).unnest('sample')

lf.sink_parquet('test.parquet')

test_Df = pl.read_parquet('test.parquet')
print(test_Df.head())

Output:

shape: (3, 2)
┌─────────────────────────┬────────────┐
│ sample                  ┆ banananana │
│ ---                     ┆ ---        │
│ list[str]               ┆ str        │
╞═════════════════════════╪════════════╡
│ ["ENST123456768", "19"] ┆ chiqita    │
│ ["ENST987654321", "20"] ┆ baaan      │
│ ["ENST567890123", "21"] ┆ banabana   │
└─────────────────────────┴────────────┘

What I try to do: the column "sample" should become 2 columns:

┌────────────────────┬────────────┬────────────┐
│ sample             ┆  version   ┆ banananana │
│ ---                ┆  ---       ┆ ---        │
│ str                ┆  str       ┆ str        │
╞════════════════════╪════════════╪════════════╡
│ ENST123456768      ┆ 19         │ chiqita    │
│ ENST987654321      ┆ 20         │ baaan      │
│ ENST567890123      ┆ 21         │ banabana   │
└────────────────────┴────────────┴────────────┘

I thought how hard can it be to achieve with lazyframes. Yeah I already wasted more than 8 hours... So, its pretty complicated I guess? Any help is highly apreciated! Best regards


Solution

  • You can achieve what you want using str.split_exact and then unnest:

    import polars as pl
    
    lf = pl.LazyFrame({"sample": ["ENST123456768.19", "ENST987654321.20", "ENST567890123.21"], "banananana": ["chiqita", "baaan", "banabana"]})
    
    lf = lf.with_columns(pl.col('sample').str.split_exact('.', 1)).unnest("sample")
    lf.sink_parquet('test.parquet')
    
    test_Df = pl.read_parquet('test.parquet')
    print(test_Df.head())
    

    Output:

    shape: (3, 3)
    ┌───────────────┬─────────┬────────────┐
    │ field_0       ┆ field_1 ┆ banananana │
    │ ---           ┆ ---     ┆ ---        │
    │ str           ┆ str     ┆ str        │
    ╞═══════════════╪═════════╪════════════╡
    │ ENST123456768 ┆ 19      ┆ chiqita    │
    │ ENST987654321 ┆ 20      ┆ baaan      │
    │ ENST567890123 ┆ 21      ┆ banabana   │
    └───────────────┴─────────┴────────────┘
    

    You can then easily rename the columns if you need

    Edit:

    As suggested in the comments here is the code to rename the columns and cast the version column to uint:

    ...
    
    lf.with_columns(pl.col('sample').str.split_exact('.', 1)).unnest("sample").rename({'field_0':'sample', 'field_1': 'version'})  # rename the columns
    lf = lf.with_columns(pl.col('version').cast(pl.UInt8))  # cast version to uint
    lf.sink_parquet('test.parquet')
    
    ...