python pandas dataframe data-cleaning data-preprocessing

Calculate by how much a row has shifted horizontally in pandas dataframe

I have a dataframe where the rows have been shifted horizontally by an unknown amount. Each and every row has shifted by a different amount as shown below:

Heading 1	Heading 2	Unnamed: 1	Unnamed: 2
NaN	34	24	NaN
5	NaN	NaN	NaN
NaN	NaN	13	77
NaN	NaN	NaN	18

In the above dataframe, there are only 2 original columns (Heading 1 and Heading 2) but due to row shift (in rows 1 and 3), extra columns (Unnamed: 1 and Unnamed: 2) have been created with the default name Unnamed: 1 and Unnamed: 2.

Now for each row, I want to calculate:

1.) The spill over. Spill over is basically the amount of NaN values in extra columns(Unnamed columns). For example in row 1 there is one non NaN value in extra columns (Unnamed: 1) and hence the spill over is 1. In row 2 there are no non NaN values in extra columns so the spill over is 0. In row 3 there are 2 non NaN values in extra columns(Unnamed: 1 and Unnamed: 2) hence the spill over is 2 and in row 4 there are 1 non NaN values in extra columns so the spill over is 1.

2.) The amount of NaN values in the original columns(Heading 1 and Heading 2). For example in row 1 amount of Nan values in original columns are 1, in row 2 amount of NaN values in original columns is 0, in row 3 amount of NaN values in original columns is 2 and in row 4 amount of NaN values in original columns is 2.

So basically for each row, I have to calculate the amount of Nan values in original columns(Heading 1 and Heading 2) and the amount of non NaN values in extra columns(Unnamed: 1 and Unnamed: 2).

I can get the amount of extra columns (Unnamed:1 and so on) present in a dataframe by:

len(df.filter(regex=("Unnamed:.*")).columns.to_list())

Thank you!

Solution

Updated Answer

The logic that @mozway gave was an elegant one liner which i liked a lot but for some reason does not work always. Also it does not give the non nan values in the extra columns.

I managed to get it working in a slightly long but relatively simple to understand logic. Here goes:

#read the excel file
df = pd.read_excel('df.xlsx')

#subset the df into original and extra df's
extra = df.filter(regex=("Unnamed:.*"))
original = df.drop(extra, axis = 1)

#ori contains a list of count of NaN values in original columns as asked 
ori = original.isnull().sum(axis=1).tolist() #or to_dict() if you want a dict
ext = len(extra.columns) - extra.isnull().sum(axis=1)
#ext1 contains a list of count of non NaN values in the extra columns as asked
ext1 = ext.tolist() # or to_dict() if you want a dict

Original comment/answer

@mozway As mentioned in the comments I am adding your code I tried to apply the logic to only a subset of dataframe:

extra = df.filter(regex=("Unnamed:.*"))
y = extra.isna().cummin(axis=1).sum(axis=1).clip(upper=2).tolist()

According to the dataframe the output should be [1, 2, 0, 1] (as there are 1 nan values in row 1, 2 in row 2 0 in row 3 and 1 in row 4) but the above code is giving output [0, 2, 0, 1]