Search code examples
pythonpandasdataframe

Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)


I guess I know why I am getting this error. It's because the inner list does not match the outer list. It just did not click on how to deal with this problem

The code is pretty easy I got data frame df which has many columns. I want to drop all columns that have more than 70% zero data however that rule will have to apply for columns after column 22.

df = df.loc[:, (df.iloc[:, 22:]==0).mean() < 0.7]

Solution

  • You got the error because the 2nd parameter you passed to df.loc is a boolean array but since it is based on the slice [22:], it is shorter than the column index of df itself. Hence, when this shorter boolean array is presented to df itself in df.loc for its boolean indexing, df is unable to work based on a shorter array.

    You can mitigate this by simply using:

    df.iloc[:, 22:].loc[:, (df != 0).mean() < 0.7]
    

    It works for df with a shorter portion to see a boolean array of longer length but not the other way round.

    If you just want to retain your original dataframe with the portion of only columns starting from 22:, you can reassign it to your original dataframe name, as follows:

    df = df.iloc[:, 22:].loc[:, (df != 0).mean() < 0.7]
    

    However, if you want your final dataframe contains also the columns from 0:22, you can .join() the columns in front with those filtered columns, as follows:

    df1 = df.iloc[:, 22:].loc[:, (df != 0).mean() < 0.7]
    df = df.iloc[:, :22].join(df1)