I have a dataframe where the rows have been shifted horizontally by an unknown amount. Each and every row has shifted by a different amount as shown below:
Heading 1 | Heading 2 | Unnamed: 1 | Unnamed: 2 |
---|---|---|---|
NaN | 34 | 24 | NaN |
5 | NaN | NaN | NaN |
NaN | NaN | 13 | 77 |
NaN | NaN | NaN | 18 |
In the above dataframe, there are only 2 original columns (Heading 1 and Heading 2) but due to row shift (in rows 1 and 3), extra columns (Unnamed: 1 and Unnamed: 2) have been created with the default name Unnamed: 1 and Unnamed: 2.
Now for each row, I want to calculate:
1.) The spill over. Spill over is basically the amount of NaN values in extra columns(Unnamed columns). For example in row 1 there is one non NaN value in extra columns (Unnamed: 1) and hence the spill over is 1. In row 2 there are no non NaN values in extra columns so the spill over is 0. In row 3 there are 2 non NaN values in extra columns(Unnamed: 1 and Unnamed: 2) hence the spill over is 2 and in row 4 there are 1 non NaN values in extra columns so the spill over is 1.
2.) The amount of NaN values in the original columns(Heading 1 and Heading 2). For example in row 1 amount of Nan values in original columns are 1, in row 2 amount of NaN values in original columns is 0, in row 3 amount of NaN values in original columns is 2 and in row 4 amount of NaN values in original columns is 2.
So basically for each row, I have to calculate the amount of Nan values in original columns(Heading 1 and Heading 2) and the amount of non NaN values in extra columns(Unnamed: 1 and Unnamed: 2).
I can get the amount of extra columns (Unnamed:1 and so on) present in a dataframe by:
len(df.filter(regex=("Unnamed:.*")).columns.to_list())
Thank you!
The logic that @mozway gave was an elegant one liner which i liked a lot but for some reason does not work always. Also it does not give the non nan values in the extra columns.
I managed to get it working in a slightly long but relatively simple to understand logic. Here goes:
#read the excel file
df = pd.read_excel('df.xlsx')
#subset the df into original and extra df's
extra = df.filter(regex=("Unnamed:.*"))
original = df.drop(extra, axis = 1)
#ori contains a list of count of NaN values in original columns as asked
ori = original.isnull().sum(axis=1).tolist() #or to_dict() if you want a dict
ext = len(extra.columns) - extra.isnull().sum(axis=1)
#ext1 contains a list of count of non NaN values in the extra columns as asked
ext1 = ext.tolist() # or to_dict() if you want a dict
@mozway As mentioned in the comments I am adding your code I tried to apply the logic to only a subset of dataframe:
extra = df.filter(regex=("Unnamed:.*"))
y = extra.isna().cummin(axis=1).sum(axis=1).clip(upper=2).tolist()
According to the dataframe the output should be [1, 2, 0, 1] (as there are 1 nan values in row 1, 2 in row 2 0 in row 3 and 1 in row 4) but the above code is giving output [0, 2, 0, 1]