Search code examples
pythonpandasdataframedata-cleaningdata-preprocessing

Calculate by how much a row has shifted horizontally in pandas dataframe


I have a dataframe where the rows have been shifted horizontally by an unknown amount. Each and every row has shifted by a different amount as shown below:

Heading 1 Heading 2 Unnamed: 1 Unnamed: 2
NaN 34 24 NaN
5 NaN NaN NaN
NaN NaN 13 77
NaN NaN NaN 18

In the above dataframe, there are only 2 original columns (Heading 1 and Heading 2) but due to row shift (in rows 1 and 3), extra columns (Unnamed: 1 and Unnamed: 2) have been created with the default name Unnamed: 1 and Unnamed: 2.

Now for each row, I want to calculate:

1.) The spill over. Spill over is basically the amount of NaN values in extra columns(Unnamed columns). For example in row 1 there is one non NaN value in extra columns (Unnamed: 1) and hence the spill over is 1. In row 2 there are no non NaN values in extra columns so the spill over is 0. In row 3 there are 2 non NaN values in extra columns(Unnamed: 1 and Unnamed: 2) hence the spill over is 2 and in row 4 there are 1 non NaN values in extra columns so the spill over is 1.

2.) The amount of NaN values in the original columns(Heading 1 and Heading 2). For example in row 1 amount of Nan values in original columns are 1, in row 2 amount of NaN values in original columns is 0, in row 3 amount of NaN values in original columns is 2 and in row 4 amount of NaN values in original columns is 2.

So basically for each row, I have to calculate the amount of Nan values in original columns(Heading 1 and Heading 2) and the amount of non NaN values in extra columns(Unnamed: 1 and Unnamed: 2).

I can get the amount of extra columns (Unnamed:1 and so on) present in a dataframe by:

len(df.filter(regex=("Unnamed:.*")).columns.to_list())

Thank you!


Solution

  • Updated Answer

    The logic that @mozway gave was an elegant one liner which i liked a lot but for some reason does not work always. Also it does not give the non nan values in the extra columns.

    I managed to get it working in a slightly long but relatively simple to understand logic. Here goes:

    #read the excel file
    df = pd.read_excel('df.xlsx')
    
    #subset the df into original and extra df's
    extra = df.filter(regex=("Unnamed:.*"))
    original = df.drop(extra, axis = 1)
    
    #ori contains a list of count of NaN values in original columns as asked 
    ori = original.isnull().sum(axis=1).tolist() #or to_dict() if you want a dict
    ext = len(extra.columns) - extra.isnull().sum(axis=1)
    #ext1 contains a list of count of non NaN values in the extra columns as asked
    ext1 = ext.tolist() # or to_dict() if you want a dict
    

    Original comment/answer

    @mozway As mentioned in the comments I am adding your code I tried to apply the logic to only a subset of dataframe:

    extra = df.filter(regex=("Unnamed:.*"))
    y = extra.isna().cummin(axis=1).sum(axis=1).clip(upper=2).tolist()
    

    According to the dataframe the output should be [1, 2, 0, 1] (as there are 1 nan values in row 1, 2 in row 2 0 in row 3 and 1 in row 4) but the above code is giving output [0, 2, 0, 1]