Search code examples
pythonpython-polars

How is Python Polars treating the index?


I want to try out polars in Python so what I want to do is concatenate several dataframes that are read from jsons. When I change the index to date and have a look at lala1.head() I see that the column date is gone, so I basically lose the index. Is there a better solution or do I need to sort by date, which basically does the same as setting the index to date?

import polars as pl

quarterly_balance_df = pl.read_json('../AAPL/single_statements/1985-09-30-quarterly_balance.json')


q1 = quarterly_balance_df.lazy().with_columns(pl.col("date").str.to_date())
quarterly_balance_df = q1.collect()
q2 = quarterly_balance_df.lazy().with_columns(pl.col("fillingDate").str.to_date())
quarterly_balance_df = q2.collect()
q3 = quarterly_balance_df.lazy().with_columns(pl.col("acceptedDate").str.to_date())
quarterly_balance_df = q3.collect()

quarterly_balance_df2 = pl.read_json('../AAPL/single_statements/1986-09-30-quarterly_balance.json')

q1 = quarterly_balance_df2.lazy().with_columns(pl.col("date").str.to_date())
quarterly_balance_df2 = q1.collect()
q2 = quarterly_balance_df2.lazy().with_columns(pl.col("fillingDate").str.to_date())
quarterly_balance_df2 = q2.collect()
q3 = quarterly_balance_df2.lazy().with_columns(pl.col("acceptedDate").str.to_date())
quarterly_balance_df2 = q3.collect()

lala1 = pl.from_pandas(quarterly_balance_df.to_pandas().set_index('date'))
lala2 = pl.from_pandas(quarterly_balance_df.to_pandas().set_index('date'))

test = pl.concat([lala1,lala2])

Solution

  • Polars intentionally eliminates the concept of an index.

    From the "Coming from Pandas" section in the User Guide:

    Polars aims to have predictable results and readable queries, as such we think an index does not help us reach that objective.

    Indeed, the from_pandas method ignores any index. For example, if we start with this data:

    import polars as pl
    
    df = pl.DataFrame(
        {
            "key": [1, 2],
            "var1": ["a", "b"],
            "var2": ["r", "s"],
        }
    )
    print(df)
    
    shape: (2, 3)
    ┌─────┬──────┬──────┐
    │ key ┆ var1 ┆ var2 │
    │ --- ┆ ---  ┆ ---  │
    │ i64 ┆ str  ┆ str  │
    ╞═════╪══════╪══════╡
    │ 1   ┆ a    ┆ r    │
    │ 2   ┆ b    ┆ s    │
    └─────┴──────┴──────┘
    

    Now, if we export this Polars dataset to Pandas, set key as the index in Pandas, and then re-import to Polars, you'll see the 'key' column disappear.

    pl.from_pandas(df.to_pandas().set_index("key"))
    
    shape: (2, 2)
    ┌──────┬──────┐
    │ var1 ┆ var2 │
    │ ---  ┆ ---  │
    │ str  ┆ str  │
    ╞══════╪══════╡
    │ a    ┆ r    │
    │ b    ┆ s    │
    └──────┴──────┘
    

    This is why your Date column disappeared.

    In Polars, you can sort, summarize, or join by any set of columns in a DataFrame. No need to declare an index.

    I recommend looking through the Polars User Guide. It's a great place to start. And there's a section for those coming from Pandas.