Search code examples
pythonpython-polars

Add a new column into an existing Polars dataframe


I want to add a column new_column to an existing dataframe df. I know this looks like a duplicate of

Add new column to polars DataFrame

but the answer to that questions, as well as the answers to many similar questions, don't really add a column to an existing dataframe. They create a new column with another dataframe. I think this can be fixed like this:

df = df.with_columns(
    new_column = pl.lit('some_text')
)

However, rewriting the whole dataframe just to add a few columns, seems a bit of a waste to me. Is this the right approach?


Solution

  • Your question suggests that you think that when you do

    df = df.with_columns(
        new_column = pl.lit('some_text')
    )
    

    that you're copying everything over to some new df which would be really inefficient.

    You're right that that would be really inefficient but that isn't what happens. A DataFrame is just a way to organize pointers to the actual data. The hierarchy is that you have, at the top, DataFrames. Within a DataFrame are Serieses which are how columns are represented. Even at the Series level, it's still just pointers, not data. It is made up of one or more chunked arrays which fit the apache arrow memory model.

    When you "make a new df" all you're doing is organizing pointers, not data. The data doesn't move or copy.

    Conversely consider pandas's inplace parameter. It certainly makes it seem like you're modifying things in place and not making copies.

    inplace does not generally do anything inplace but makes a copy and reassigns the pointer

    https://github.com/pandas-dev/pandas/issues/16529#issuecomment-323890422

    The crux of the issue is that in pandas everything you do makes a copy (or several). In polars, that isn't the case so even when you assign a new df that new df is just an outer layer that points to data. The data doesn't move, nor is it copied unless you specifically execute an operation that does.

    That said, there are methods which will insert columns without requiring you to use the df=df... syntax but they don't do anything different under the hood as using the preferred assignment syntax.