Search code examples
python-polars

What is the exact meaning of `pl.col("")` expression with empty string argument


The example in a section about 'list context' in the polars-book uses pl.col("") expression with an empty string "" as the argument.

# the percentage rank expression
rank_pct = pl.col("").rank(descending=True) / pl.col("").count()

From the context and the output I can guess what pl.col("") expression does. But the API documentation does not seem to cover a case of empty string as the argument to pl.col and I would like to know the precise meaning in this use case. Any helpful answer is greatly appreciated!


Solution

  • The precise meaning is to act as a 'root' Expression to start a chain of Expressions inside a List context, i.e., inside list.eval(....). I'll need to take a step back to explain...

    'Root' Expressions

    In general, only certain types of Expressions are allowed to start (or be the 'root' of) an Expression. These 'root' Expressions work with a particular context (select, filter,with_columns, etc..) to identify what data is being addressed.

    Some examples of root Expressions are polars.col, polars.map_batches, polars.map_groups, polars.first, polars.last, polars.all_horizontal, and polars.any_horizontal. (There are others.)

    Once we declare a "root" Expression, we can then chain other, more-generic Expressions to perform work. For example, polars.col("my_col").sum().over('other_col').alias('name').

    The List context

    A List context is slightly different from most contexts. In a List context, there is no ambiguity as to what data is being addressed. There is only a list of data. As such, polars.col and polars.first were chosen as "root" Expressions to use within a List context.

    Normally, a polars.col root Expression contains information such as a string to denote a column name or a wildcard expression to denote multiple columns. However, this is not needed in a List context. There is only one option - the single list itself.

    As such, any string provided to polars.col is ignored in a List context. For example, from the code from the Polars Guide, this code also works:

    # Notice that I'm referring to columns that do not exist...
    rank_pct = pl.col("foo").rank(descending=True) / pl.col("bar").count()
    

    Since any string provided to a polars.col Expression will be ignored in a List context, a single empty string "" is often supplied, just to prevent unnecessary clutter.

    Edit: New polars.element expression

    Polars now has a polars.element expression designed for use in list evaluation contexts. Using polars.element is now considered idiomatic for list contexts, as it avoids confusion associated with using col(“”).