I am trying to use python polars over pandas sql for a large dataframe as I am running into memory errors. There are two where conditions that are utilized in this dataframe but can't get the syntax right.
Here is what the data looks like:
Key | Field | DateColumn |
---|---|---|
1234 | Plumb | 2020-02-01 |
1234 | Plumb | 2020-03-01 |
1234 | Pear | 2020-04-01 |
import pandas as pd
import datetime as dt
import pandasql as ps
d = {'Key':
[1234, 1234, 1234, 1234, 1234, 1234, 1234,
1234, 1234, 1234, 1234, 2456, 2456, 2456, 2456, 2456, 2456, 2456, 2456, 2456, 2456, 2456, 3754, 3754, 3754, 3754, 3754, 3754, 3754, 3754, 3754, 3754, 3754],
'Field':[
"Plumb", "Plumb", "Pear", "Plumb", "Orange", "Pear", "Plumb", "Plumb", "Pear", "Apple", "Plumb", "Orange", "Orange", "Apple", "Apple", "Pear", "Apple", "Plumb", "Plumb", "Orange", "Orange", "Pear", "Plumb", "Pear", "Plumb", "Pear", "Apple", "Plumb", "Orange", "Pear", "Apple", "Pear", "Apple"],
'DateColumn':[
'2020-02-01', '2020-03-01', '2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01', '2020-02-01', '2020-03-01', '2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01', '2020-02-01', '2020-03-01', '2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01'
]}
df = pd.DataFrame(data=d)
df['DateColumn'] = pd.to_datetime(df['DateColumn'])
df['PreviousMonth'] = df['DateColumn'] - pd.DateOffset(months=1)
df_output = ps.sqldf("""
select
a.Key
,a.Field
,b.Field as PreviousField
,a.DateColumn
,b.DateColumn as PreviousDate
from df as a, df as b
where a.Key = b.Key
and b.DateColumn = a.PreviousMonth
""")
print(df_output.head())
Key Field DateColumn PreviousDate
0 1234 Plumb 2020-03-01 00:00:00.000000 2020-02-01 00:00:00.000000
1 1234 Pear 2020-04-01 00:00:00.000000 2020-03-01 00:00:00.000000
2 1234 Plumb 2020-05-01 00:00:00.000000 2020-04-01 00:00:00.000000
3 1234 Orange 2020-06-01 00:00:00.000000 2020-05-01 00:00:00.000000
4 1234 Pear 2020-07-01 00:00:00.000000 2020-06-01 00:00:00.000000
I have tried to do
data_output = df.join(df, left_on='Key', right_on='Key')
But unable to find a good example on how to put the two conditions on the join condition.
Let's accomplish everything using Polars.
import polars as pl
df = (
pl.DataFrame(d)
.with_columns(
pl.col('DateColumn').str.to_date()
)
)
(
df
.join(
df
.with_columns(pl.col('DateColumn').alias('PreviousDate'))
.rename({'Field': 'PreviousField'}),
left_on=['Key', 'DateColumn'],
right_on=['Key', pl.col('DateColumn').dt.offset_by('1mo')],
how="inner"
)
)
shape: (30, 5)
┌──────┬────────┬────────────┬───────────────┬──────────────┐
│ Key ┆ Field ┆ DateColumn ┆ PreviousField ┆ PreviousDate │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ date ┆ str ┆ date │
╞══════╪════════╪════════════╪═══════════════╪══════════════╡
│ 1234 ┆ Plumb ┆ 2020-03-01 ┆ Plumb ┆ 2020-02-01 │
│ 1234 ┆ Pear ┆ 2020-04-01 ┆ Plumb ┆ 2020-03-01 │
│ 1234 ┆ Plumb ┆ 2020-05-01 ┆ Pear ┆ 2020-04-01 │
│ 1234 ┆ Orange ┆ 2020-06-01 ┆ Plumb ┆ 2020-05-01 │
│ 1234 ┆ Pear ┆ 2020-07-01 ┆ Orange ┆ 2020-06-01 │
│ 1234 ┆ Plumb ┆ 2020-08-01 ┆ Pear ┆ 2020-07-01 │
│ 1234 ┆ Plumb ┆ 2020-09-01 ┆ Plumb ┆ 2020-08-01 │
│ 1234 ┆ Pear ┆ 2020-10-01 ┆ Plumb ┆ 2020-09-01 │
│ 1234 ┆ Apple ┆ 2020-11-01 ┆ Pear ┆ 2020-10-01 │
│ 1234 ┆ Plumb ┆ 2020-12-01 ┆ Apple ┆ 2020-11-01 │
│ 2456 ┆ Orange ┆ 2020-03-01 ┆ Orange ┆ 2020-02-01 │
│ 2456 ┆ Apple ┆ 2020-04-01 ┆ Orange ┆ 2020-03-01 │
│ 2456 ┆ Apple ┆ 2020-05-01 ┆ Apple ┆ 2020-04-01 │
│ 2456 ┆ Pear ┆ 2020-06-01 ┆ Apple ┆ 2020-05-01 │
│ 2456 ┆ Apple ┆ 2020-07-01 ┆ Pear ┆ 2020-06-01 │
│ 2456 ┆ Plumb ┆ 2020-08-01 ┆ Apple ┆ 2020-07-01 │
│ 2456 ┆ Plumb ┆ 2020-09-01 ┆ Plumb ┆ 2020-08-01 │
│ 2456 ┆ Orange ┆ 2020-10-01 ┆ Plumb ┆ 2020-09-01 │
│ 2456 ┆ Orange ┆ 2020-11-01 ┆ Orange ┆ 2020-10-01 │
│ 2456 ┆ Pear ┆ 2020-12-01 ┆ Orange ┆ 2020-11-01 │
│ 3754 ┆ Pear ┆ 2020-03-01 ┆ Plumb ┆ 2020-02-01 │
│ 3754 ┆ Plumb ┆ 2020-04-01 ┆ Pear ┆ 2020-03-01 │
│ 3754 ┆ Pear ┆ 2020-05-01 ┆ Plumb ┆ 2020-04-01 │
│ 3754 ┆ Apple ┆ 2020-06-01 ┆ Pear ┆ 2020-05-01 │
│ 3754 ┆ Plumb ┆ 2020-07-01 ┆ Apple ┆ 2020-06-01 │
│ 3754 ┆ Orange ┆ 2020-08-01 ┆ Plumb ┆ 2020-07-01 │
│ 3754 ┆ Pear ┆ 2020-09-01 ┆ Orange ┆ 2020-08-01 │
│ 3754 ┆ Apple ┆ 2020-10-01 ┆ Pear ┆ 2020-09-01 │
│ 3754 ┆ Pear ┆ 2020-11-01 ┆ Apple ┆ 2020-10-01 │
│ 3754 ┆ Apple ┆ 2020-12-01 ┆ Pear ┆ 2020-11-01 │
└──────┴────────┴────────────┴───────────────┴──────────────┘
Note that we use an Expression in the right_on
columns to generate our offset date column on-the-fly.