Search code examples
pythonpython-polars

Polars chain multiple operations on select() with values_counts()


I'm working with a Polars dataframe and I want to perform a series of operations using the .select() method. However, I'm facing problems when I try to apply value_counts() followed by unnest() to get separate columns instead of a struct column.

If I just use the method alone, then I don't have any issues:

(
    df
    .select(
        pl.col("CustomerID"),
        pl.col("Country").value_counts(sort=True).struct.rename_fields(["Country", "State"]).first().over("CustomerID")).unnest("Country")
    .unique(maintain_order=True)

)

But, since I'm doing a series of operations like this:

(
    df
    .select(
        pl.col("CustomerID"),
        pl.col("Country").value_counts(sort=True).struct.rename_fields(["Country", "Count"]).first().over("CustomerID").unnest("Country"),
        Days_Since_Last_Purchase = pl.col("InvoiceDate").max() - pl.col("InvoiceDate").max().over("CustomerID"),
    )
    .unique(maintain_order=True)
)

I'm facing the following error:

AttributeError: 'Expr' object has no attribute 'unnest'

Example Data :

import datetime
import polars as pl

data = {'InvoiceNo': 
  ['541431',
  'C541433',
  '537626',
  '537626',
  '537626',
  '537626',
  '537626',
  '537626',
  '537626',
  '537626'],
 'StockCode': ['23166',
  '23166',
  '84997D',
  '22729',
  '22492',
  '22727',
  '22774',
  '22195',
  '22805',
  '22771'],
 'Description': ['MEDIUM CERAMIC TOP STORAGE JAR',
  'MEDIUM CERAMIC TOP STORAGE JAR',
  'PINK 3 PIECE POLKADOT CUTLERY SET',
  'ALARM CLOCK BAKELIKE ORANGE',
  'MINI PAINT SET VINTAGE ',
  'ALARM CLOCK BAKELIKE RED ',
  'RED DRAWER KNOB ACRYLIC EDWARDIAN',
  'LARGE HEART MEASURING SPOONS',
  'BLUE DRAWER KNOB ACRYLIC EDWARDIAN',
  'CLEAR DRAWER KNOB ACRYLIC EDWARDIAN'],
 'Quantity': [74215, -74215, 6, 4, 36, 4, 12, 12, 12, 12],
 'InvoiceDate': [datetime.datetime(2011, 1, 18, 10, 1),
  datetime.datetime(2011, 1, 18, 10, 17),
  datetime.datetime(2010, 12, 7, 14, 57),
  datetime.datetime(2010, 12, 7, 14, 57),
  datetime.datetime(2010, 12, 7, 14, 57),
  datetime.datetime(2010, 12, 7, 14, 57),
  datetime.datetime(2010, 12, 7, 14, 57),
  datetime.datetime(2010, 12, 7, 14, 57),
  datetime.datetime(2010, 12, 7, 14, 57),
  datetime.datetime(2010, 12, 7, 14, 57)],
 'UnitPrice': [1.0399999618530273,
  1.0399999618530273,
  3.75,
  3.75,
  0.6499999761581421,
  3.75,
  1.25,
  1.649999976158142,
  1.25,
  1.25],
 'CustomerID': ['12346',
  '12346',
  '12347',
  '12347',
  '12347',
  '12347',
  '12347',
  '12347',
  '12347',
  '12347'],
 'Country': ['United Kingdom',
  'United Kingdom',
  'Iceland',
  'Iceland',
  'Iceland',
  'Iceland',
  'Iceland',
  'Iceland',
  'Iceland',
  'Iceland'],
 'Transaction_Status': ['Completed',
  'Cancelled',
  'Completed',
  'Completed',
  'Completed',
  'Completed',
  'Completed',
  'Completed',
  'Completed',
  'Completed']}

df = pl.DataFrame(data)
df

Solution

  • Note that in your first example, you didn't call .unnest() directly on the value_counts() expression, but on the select context.

    This can also be done if the select context contains multiple expressions.

    (
        df
        .select(
            pl.col("CustomerID"),
            pl.col("Country").value_counts(sort=True).struct.rename_fields(["Country", "State"]).first().over("CustomerID"),
            Days_Since_Last_Purchase = pl.col("InvoiceDate").max() - pl.col("InvoiceDate").max().over("CustomerID"),
        )
        .unnest("Country")
        .unique(maintain_order=True)
    )
    
    shape: (2, 4)
    ┌────────────┬────────────────┬───────┬──────────────────────────┐
    │ CustomerID ┆ Country        ┆ State ┆ Days_Since_Last_Purchase │
    │ ---        ┆ ---            ┆ ---   ┆ ---                      │
    │ str        ┆ str            ┆ u32   ┆ duration[μs]             │
    ╞════════════╪════════════════╪═══════╪══════════════════════════╡
    │ 12346      ┆ United Kingdom ┆ 2     ┆ 0µs                      │
    │ 12347      ┆ Iceland        ┆ 8     ┆ 41d 19h 20m              │
    └────────────┴────────────────┴───────┴──────────────────────────┘