I have a dataframe which is being generated using pd.get_dummies as below:
df_target = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)
where column is a column name and df_column is the dataframe from which each column is being pulled to do some operations.
rev_grp_m2_> 225 rev_grp_m2_nan rev_grp_m2_nan
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
Now I do a check of variance for each column generated and skip those with zero variance.
for target_column in list(df_target.columns):
# If variance of the dummy created is zero : append it to a list and print to log file.
if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
df_final[target_column] = df_target[target_column]
Here due to two columns being the same , I get a Key Error for the np.var line. There are two values of variance for the nan column:
erev_grp_m2_nan 0.000819
rev_grp_m2_nan 0.000000
Ideally I would like to take the one with non-zero variance and drop/skip the one with 0 var.
Can someone please help me do this?
For DataFrame.var
use:
print (df.var())
rev_grp_m2_> 225 0.083333
rev_grp_m2_nan 0.000000
rev_grp_m2_nan 0.000000
Last for filtering is used boolean indexing
:
out = df.loc[:, df.var()!= 0]
print (out)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0
EDIT: You can get indices of non 0 values and then seelct by iloc
:
cols = [i for i in np.arange(len(df.columns)) if np.var(df.iloc[:, i]) != 0]
print (cols)
[0]
df = df.iloc[:, cols]
print (df)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0
Another idea is filter out if all values are 0
:
cols = [i for i in np.arange(len(df.columns)) if (df.iloc[:, i] != 0).any()]
out = df.iloc[:, cols]
Or:
out = df.loc[:, (df != 0).any()]
print (out)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0