I have this dataset where I have some columns (not important to the calculations) and then many columns with same starting name. I want to calculate the sum of those columns per one row which contains else than NaN-value. The set looks something like this:
id | something | number1 | number2 | number3 | number4 |
---|---|---|---|---|---|
1 | 105 | 200 | NaN | NaN | 50 |
2 | 300 | 2 | 1 | 1 | 33 |
3 | 20 | 1 | NaN | NaN | NaN |
So I want to create new column that contains the length of the number columns that have a value. So the final dataset would look like this:
id | something | number1 | number2 | number3 | number4 | sum_columns |
---|---|---|---|---|---|---|
1 | 105 | 200 | NaN | NaN | 50 | 2 |
2 | 300 | 2 | 1 | 1 | 33 | 4 |
3 | 20 | 1 | NaN | NaN | NaN | 1 |
I know I can calculate the length of columns that start by specific name something like this:
df[df.columns[pd.Series(df.columns).str.startswith('number')]]
but I cant figure out, how can I add condition that there has to be other than NaN value and also how to apply it to every row. I think it could be done with lambda? but haven't succeeded yet.
# filter column on 'number' and count
df['sum_columns']=df.filter(like='number').count(axis=1)
df
id something number1 number2 number3 number4 sum_columns
0 1 105 200 NaN NaN 50.0 2
1 2 300 2 1.0 1.0 33.0 4
2 3 20 1 NaN NaN NaN 1
PS: Your first DF and second DF, the NaN count don't match. I used the second DF in the solution