One of the key data analytics outputs is tables. And to analyse large research databases, we frequently create nested tables with two or more levels of nesting in rows and / or columns. I could create the nested tables in Pandas, but don't know if they can be created in Polars.
I am using the database from Kaggle called 'Home Loan Approval'. It's URL is https://www.kaggle.com/datasets/rishikeshkonapure/home-loan-approval.
Sample:
Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
LP001015,Male,Yes,0,Graduate,No,5720,0,110,360,1,Urban
LP001022,Male,Yes,1,Graduate,No,3076,1500,126,360,1,Urban
LP001031,Male,Yes,2,Graduate,No,5000,1800,208,360,1,Urban
LP001035,Male,Yes,2,Graduate,No,2340,2546,100,360,,Urban
LP001051,Male,No,0,Not Graduate,No,3276,0,78,360,1,Urban
LP001054,Male,Yes,0,Not Graduate,Yes,2165,3422,152,360,1,Urban
LP001055,Female,No,1,Not Graduate,No,2226,0,59,360,1,Semiurban
LP001056,Male,Yes,2,Not Graduate,No,3881,0,147,360,0,Rural
LP001059,Male,Yes,2,Graduate,,13633,0,280,240,1,Urban
LP001067,Male,No,0,Not Graduate,No,2400,2400,123,360,1,Semiurban
LP001078,Male,No,0,Not Graduate,No,3091,0,90,360,1,Urban
LP001082,Male,Yes,1,Graduate,,2185,1516,162,360,1,Semiurban
LP001083,Male,No,3+,Graduate,No,4166,0,40,180,,Urban
LP001094,Male,Yes,2,Graduate,,12173,0,166,360,0,Semiurban
LP001096,Female,No,0,Graduate,No,4666,0,124,360,1,Semiurban
LP001099,Male,No,1,Graduate,No,5667,0,131,360,1,Urban
I use the following code in Pandas to create a nested table with 'Gender' and 'Married' in rows and 'Property_Area' and 'Self_Employed' in columns
(
pd.read_csv('loan_sanction_test.csv')
.groupby(['Gender', 'Married', 'Property_Area', 'Self_Employed'])['LoanAmount'].sum()
.unstack([2,3])
)
Property_Area Rural Semiurban Urban
Self_Employed No Yes No Yes No Yes
Gender Married
Female No 1062.0 389.0 1511.0 295.0 1716.0 NaN
Yes 596.0 NaN 1195.0 NaN 1610.0 NaN
Male No 2398.0 534.0 3110.0 125.0 3219.0 232.0
Yes 7178.0 1692.0 6455.0 1073.0 9216.0 1011.0
In Polars, we don't have multi-indexes like Pandas. Polars pivot
function only pivots to a single level. For e.g. if I use ['Property_Area' and 'Self_Employed'] in columns, pivot will reproduce these variables one after another and not nest them. Here's the code which illustrates this (using Polars version 0.16):
(
pl.read_csv('loan_sanction_test.csv')
.group_by('Gender', 'Married', 'Property_Area', 'Self_Employed').agg(pl.col('LoanAmount').sum())
.pivot(on=['Property_Area', 'Self_Employed'], index=['Gender', 'Married'], values='LoanAmount')
)
We frequently use three level deep nesting in rows as well as columns. Is there a way to generate nested tables in Polars
like Pandas
example above?
One thing you can try when nesting columns is concat_str
. For example: (My firewall blocks Kaggle downloads so I've simulated data.)
(
df
.sort('Property_Area', 'Self_Employed')
.with_columns(
pl.concat_str('Property_Area', 'Self_Employed', separator="|").alias('PA_SE')
)
.pivot(
on=["PA_SE"],
index=["Gender", "Married"],
values="LoanAmount",
aggregate_function=pl.len()
)
)
shape: (4, 8)
+--------+---------+----------+-----------+--------------+---------------+----------+-----------+
| Gender | Married | Rural|No | Rural|Yes | Semiurban|No | Semiurban|Yes | Urban|No | Urban|Yes |
| --- | --- | --- | --- | --- | --- | --- | --- |
| str | str | u32 | u32 | u32 | u32 | u32 | u32 |
+===============================================================================================+
| Female | No | 38 | 35 | 54 | 37 | 40 | 41 |
| Female | Yes | 46 | 45 | 38 | 45 | 45 | 38 |
| Male | Yes | 40 | 39 | 47 | 44 | 39 | 40 |
| Male | No | 34 | 48 | 44 | 48 | 45 | 30 |
+--------+---------+----------+-----------+--------------+---------------+----------+-----------+
It's not as slick as Pandas, but it does get you to an answer.