Search code examples
pythondataframepysparkfilteringdata-manipulation

Create subsets in Python based on special conditions


I am currently doing some data manipulation procedures and have run into a problem of how to make subsets based on special conditions.

My example (dataframe) is like this:

Name    ID     ContractDate LoanSum DurationOfDelay
A       ID1    2023-01-01   10      0 
A       ID1    2023-01-03   15      0
A       ID1    2022-12-29   20      35
A       ID1    2022-12-28   40      91

B       ID2    2023-01-05   15      0
B       ID2    2023-01-10   30      100
B       ID2    2023-01-07   35      40
B       ID2    2023-01-06   35      0

C       ID3    2023-01-09   20      0
C       ID3    2023-01-07   30      0
C       ID3    2023-01-11   35      0

My goal is to create two different subsets (two new dataframes):

  1. Create a table that includes the loans only issued last

Expected result:

Name    ID     ContractDate LoanSum DurationOfDelay
A       ID1    2023-01-03   15      0
B       ID2    2023-01-10   30      100
C       ID3    2023-01-11   35      0
  1. Group the data in such a way that for each borrower only the loan issued first with DurationOfDelay > 0 is returned

Expected result:

Name    ID     ContractDate LoanSum DurationOfDelay
A       ID1    2022-12-28   40      91
B       ID2    2023-01-07   35      40

Would you be so kind to help me achieve these results? Any kind of help is highly appreciated!


Solution

  • window_spec = Window.partitionBy('Name', 'ID').orderBy(F.desc('ContractDate'))
    subset1 = df.withColumn('row_num', F.row_number().over(window_spec)) \
                .filter(F.col('row_num') == 1) \
                .drop('row_num')
    subset1.show()
    window_spec_subset2 = Window.partitionBy('Name', 'ID').orderBy('ContractDate')
    subset2 = df.filter(F.col('DurationOfDelay') > 0) \
                .withColumn('row_num', F.row_number().over(window_spec_subset2)) \
                .filter(F.col('row_num') == 1) \
                .drop('row_num')
    subset2.show()
    
    +----+---+------------+-------+---------------+
    |Name| ID|ContractDate|LoanSum|DurationOfDelay|
    +----+---+------------+-------+---------------+
    |   A|ID1|  2023-01-03|     15|              0|
    |   B|ID2|  2023-01-10|     30|            100|
    |   C|ID3|  2023-01-11|     35|              0|
    +----+---+------------+-------+---------------+
    
    +----+---+------------+-------+---------------+
    |Name| ID|ContractDate|LoanSum|DurationOfDelay|
    +----+---+------------+-------+---------------+
    |   A|ID1|  2022-12-28|     40|             91|
    |   B|ID2|  2023-01-07|     35|             40|
    +----+---+------------+-------+---------------+