I want to try out polars in Python so what I want to do is concatenate several dataframes that are read from jsons. When I change the index to date
and have a look at lala1.head()
I see that the column date
is gone, so I basically lose the index. Is there a better solution or do I need to sort by date, which basically does the same as setting the index to date
?
import polars as pl
quarterly_balance_df = pl.read_json('../AAPL/single_statements/1985-09-30-quarterly_balance.json')
q1 = quarterly_balance_df.lazy().with_columns(pl.col("date").str.to_date())
quarterly_balance_df = q1.collect()
q2 = quarterly_balance_df.lazy().with_columns(pl.col("fillingDate").str.to_date())
quarterly_balance_df = q2.collect()
q3 = quarterly_balance_df.lazy().with_columns(pl.col("acceptedDate").str.to_date())
quarterly_balance_df = q3.collect()
quarterly_balance_df2 = pl.read_json('../AAPL/single_statements/1986-09-30-quarterly_balance.json')
q1 = quarterly_balance_df2.lazy().with_columns(pl.col("date").str.to_date())
quarterly_balance_df2 = q1.collect()
q2 = quarterly_balance_df2.lazy().with_columns(pl.col("fillingDate").str.to_date())
quarterly_balance_df2 = q2.collect()
q3 = quarterly_balance_df2.lazy().with_columns(pl.col("acceptedDate").str.to_date())
quarterly_balance_df2 = q3.collect()
lala1 = pl.from_pandas(quarterly_balance_df.to_pandas().set_index('date'))
lala2 = pl.from_pandas(quarterly_balance_df.to_pandas().set_index('date'))
test = pl.concat([lala1,lala2])
Polars intentionally eliminates the concept of an index.
From the "Coming from Pandas" section in the User Guide:
Polars aims to have predictable results and readable queries, as such we think an index does not help us reach that objective.
Indeed, the from_pandas
method ignores any index. For example, if we start with this data:
import polars as pl
df = pl.DataFrame(
{
"key": [1, 2],
"var1": ["a", "b"],
"var2": ["r", "s"],
}
)
print(df)
shape: (2, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪══════╪══════╡
│ 1 ┆ a ┆ r │
│ 2 ┆ b ┆ s │
└─────┴──────┴──────┘
Now, if we export this Polars dataset to Pandas, set key
as the index in Pandas, and then re-import to Polars, you'll see the 'key' column disappear.
pl.from_pandas(df.to_pandas().set_index("key"))
shape: (2, 2)
┌──────┬──────┐
│ var1 ┆ var2 │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════╡
│ a ┆ r │
│ b ┆ s │
└──────┴──────┘
This is why your Date
column disappeared.
In Polars, you can sort, summarize, or join by any set of columns in a DataFrame. No need to declare an index.
I recommend looking through the Polars User Guide. It's a great place to start. And there's a section for those coming from Pandas.