Search code examples
pythonpysparksubsetdata-manipulationrow-number

Display the row index number given conditions


I am currently trying to practice with some data manipulation procedures and have faced with the problem of how to make subset based on special condition.

Let's assume that the dataframe looks like this:

Name    ID     ContractDate LoanSum DurationOfDelay
A       ID1    2023-01-01   10      10 
A       ID1    2023-01-03   15      15
A       ID1    2022-12-29   20      0
A       ID1    2022-12-28   40      0
B       ID2    2023-01-05   15      19
B       ID2    2023-01-10   30      0
B       ID2    2023-01-07   35      25
B       ID2    2023-01-06   35      0

My goal is to display for each unique ID (or Name) the index number of the loan issued first with DurationOfDelay > 0

Expected result:

Name    ID     IndexNum
A       ID1    3 
B       ID2    1

Explanation: For ID1 four loans were issued: on 2022-12-28, 2022-12-29, 2023-01-01 and 2023-01-03. We can identify the existence of DurationOfDelay > 0 first on 2023-01-01, and this is the third loan issued to the borrower.

For ID2 also four loans were issued: on 2023-01-05, 2023-01-06, 2023-01-07 and 2023-01-10. We can identify the existence of DurationOfDelay > 0 first on 2023-01-05, and this is the firstloan issued to the borrower.

What I have done so far:

window_spec_subset = Window.partitionBy('ID').orderBy('ContractDate')
subset = df.filter(F.col('DurationOfDelay') > 0) \
                .withColumn('row_num', F.row_number().over(window_spec_subset)) \
                .filter(F.col('row_num') == 1) \
                .drop('row_num')
subset.show()

+----+---+------------+-------+---------------+
|Name| ID|ContractDate|LoanSum|DurationOfDelay|
+----+---+------------+-------+---------------+
|   A|ID1|  2023-01-01|     10|             10|
|   B|ID2|  2023-01-05|     15|             19|
+----+---+------------+-------+---------------+

This code allows me to group the data in such a way that for each borrower only the loan issued first with DurationOfDelay > 0 is returned.

But I'm stacked to display the index number of the loan issued first with DurationOfDelay > 0 instead.

Would you be so kind to help me achieve these results? Any kind of help is highly appreciated!


Solution

  • Probably not the most efficient but it should work:

    subset = (df.withColumn('IndexNum', F.row_number().over(window_spec_subset))
                .filter(F.col('DurationOfDelay') > 0)
                .withColumn('row_num', F.row_number().over(window_spec_subset))
                .filter(F.col('row_num') == 1)
                .drop('row_num'))
    

    Output:

    >>> subset.show()
    +----+---+------------+-------+---------------+--------+
    |Name| ID|ContractDate|LoanSum|DurationOfDelay|IndexNum|
    +----+---+------------+-------+---------------+--------+
    |   A|ID1|  2023-01-01|     10|             10|       3|
    |   B|ID2|  2023-01-05|     15|             19|       1|
    +----+---+------------+-------+---------------+--------+