Search code examples
python-polars

Nested Tables in Python Polars


One of the key data analytics outputs is tables. And to analyse large research databases, we frequently create nested tables with two or more levels of nesting in rows and / or columns. I could create the nested tables in Pandas, but don't know if they can be created in Polars.

I am using the database from Kaggle called 'Home Loan Approval'. It's URL is https://www.kaggle.com/datasets/rishikeshkonapure/home-loan-approval.

Sample:

Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
LP001015,Male,Yes,0,Graduate,No,5720,0,110,360,1,Urban
LP001022,Male,Yes,1,Graduate,No,3076,1500,126,360,1,Urban
LP001031,Male,Yes,2,Graduate,No,5000,1800,208,360,1,Urban
LP001035,Male,Yes,2,Graduate,No,2340,2546,100,360,,Urban
LP001051,Male,No,0,Not Graduate,No,3276,0,78,360,1,Urban
LP001054,Male,Yes,0,Not Graduate,Yes,2165,3422,152,360,1,Urban
LP001055,Female,No,1,Not Graduate,No,2226,0,59,360,1,Semiurban
LP001056,Male,Yes,2,Not Graduate,No,3881,0,147,360,0,Rural
LP001059,Male,Yes,2,Graduate,,13633,0,280,240,1,Urban
LP001067,Male,No,0,Not Graduate,No,2400,2400,123,360,1,Semiurban
LP001078,Male,No,0,Not Graduate,No,3091,0,90,360,1,Urban
LP001082,Male,Yes,1,Graduate,,2185,1516,162,360,1,Semiurban
LP001083,Male,No,3+,Graduate,No,4166,0,40,180,,Urban
LP001094,Male,Yes,2,Graduate,,12173,0,166,360,0,Semiurban
LP001096,Female,No,0,Graduate,No,4666,0,124,360,1,Semiurban
LP001099,Male,No,1,Graduate,No,5667,0,131,360,1,Urban

I use the following code in Pandas to create a nested table with 'Gender' and 'Married' in rows and 'Property_Area' and 'Self_Employed' in columns

(
    pd.read_csv('loan_sanction_test.csv')
    .groupby(['Gender', 'Married', 'Property_Area', 'Self_Employed'])['LoanAmount'].sum()
    .unstack([2,3])
)
Property_Area    Rural         Semiurban           Urban        
Self_Employed       No     Yes        No     Yes      No     Yes
Gender Married                                                  
Female No       1062.0   389.0    1511.0   295.0  1716.0     NaN
       Yes       596.0     NaN    1195.0     NaN  1610.0     NaN
Male   No       2398.0   534.0    3110.0   125.0  3219.0   232.0
       Yes      7178.0  1692.0    6455.0  1073.0  9216.0  1011.0

In Polars, we don't have multi-indexes like Pandas. Polars pivot function only pivots to a single level. For e.g. if I use ['Property_Area' and 'Self_Employed'] in columns, pivot will reproduce these variables one after another and not nest them. Here's the code which illustrates this (using Polars version 0.16):

(
    pl.read_csv('loan_sanction_test.csv')
    .group_by('Gender', 'Married', 'Property_Area', 'Self_Employed').agg(pl.col('LoanAmount').sum())
    .pivot(on=['Property_Area', 'Self_Employed'], index=['Gender', 'Married'], values='LoanAmount')
)

We frequently use three level deep nesting in rows as well as columns. Is there a way to generate nested tables in Polars like Pandas example above?


Solution

  • One thing you can try when nesting columns is concat_str. For example: (My firewall blocks Kaggle downloads so I've simulated data.)

    (
        df
        .sort('Property_Area', 'Self_Employed')
        .with_columns(
            pl.concat_str('Property_Area', 'Self_Employed', separator="|").alias('PA_SE')
        )
        .pivot(
            on=["PA_SE"],
            index=["Gender", "Married"],
            values="LoanAmount",
            aggregate_function=pl.len()
        )
    )
    
    shape: (4, 8)
    +--------+---------+----------+-----------+--------------+---------------+----------+-----------+
    | Gender | Married | Rural|No | Rural|Yes | Semiurban|No | Semiurban|Yes | Urban|No | Urban|Yes |
    | ---    | ---     | ---      | ---       | ---          | ---           | ---      | ---       |
    | str    | str     | u32      | u32       | u32          | u32           | u32      | u32       |
    +===============================================================================================+
    | Female | No      | 38       | 35        | 54           | 37            | 40       | 41        |
    | Female | Yes     | 46       | 45        | 38           | 45            | 45       | 38        |
    | Male   | Yes     | 40       | 39        | 47           | 44            | 39       | 40        |
    | Male   | No      | 34       | 48        | 44           | 48            | 45       | 30        |
    +--------+---------+----------+-----------+--------------+---------------+----------+-----------+
    

    It's not as slick as Pandas, but it does get you to an answer.