I am currently doing some data manipulation procedures and have run into a problem of how to make subsets based on special conditions.
My example (dataframe) is like this:
Name ID ContractDate LoanSum DurationOfDelay
A ID1 2023-01-01 10 0
A ID1 2023-01-03 15 0
A ID1 2022-12-29 20 35
A ID1 2022-12-28 40 91
B ID2 2023-01-05 15 0
B ID2 2023-01-10 30 100
B ID2 2023-01-07 35 40
B ID2 2023-01-06 35 0
C ID3 2023-01-09 20 0
C ID3 2023-01-07 30 0
C ID3 2023-01-11 35 0
My goal is to create two different subsets (two new dataframes):
Expected result:
Name ID ContractDate LoanSum DurationOfDelay
A ID1 2023-01-03 15 0
B ID2 2023-01-10 30 100
C ID3 2023-01-11 35 0
Expected result:
Name ID ContractDate LoanSum DurationOfDelay
A ID1 2022-12-28 40 91
B ID2 2023-01-07 35 40
Would you be so kind to help me achieve these results? Any kind of help is highly appreciated!
window_spec = Window.partitionBy('Name', 'ID').orderBy(F.desc('ContractDate'))
subset1 = df.withColumn('row_num', F.row_number().over(window_spec)) \
.filter(F.col('row_num') == 1) \
.drop('row_num')
subset1.show()
window_spec_subset2 = Window.partitionBy('Name', 'ID').orderBy('ContractDate')
subset2 = df.filter(F.col('DurationOfDelay') > 0) \
.withColumn('row_num', F.row_number().over(window_spec_subset2)) \
.filter(F.col('row_num') == 1) \
.drop('row_num')
subset2.show()
+----+---+------------+-------+---------------+
|Name| ID|ContractDate|LoanSum|DurationOfDelay|
+----+---+------------+-------+---------------+
| A|ID1| 2023-01-03| 15| 0|
| B|ID2| 2023-01-10| 30| 100|
| C|ID3| 2023-01-11| 35| 0|
+----+---+------------+-------+---------------+
+----+---+------------+-------+---------------+
|Name| ID|ContractDate|LoanSum|DurationOfDelay|
+----+---+------------+-------+---------------+
| A|ID1| 2022-12-28| 40| 91|
| B|ID2| 2023-01-07| 35| 40|
+----+---+------------+-------+---------------+