Search code examples
python-2.7pandasimpyla

Not calculating sum for all columns in pandas dataframe


I'm pulling data from Impala using impyla, and converting them to dataframe using as_pandas. And I'm using Pandas 0.18.0, Python 2.7.9

I'm trying to calculate the sum of all columns in a dataframe and trying to select the columns which are greater than the threshold.

self.data = self.data.loc[:,self.data.sum(axis=0) > 15]

But when I run this I'm getting error like below:

pandas.core.indexing.IndexingError: Unalignable boolean Series key provided

Then I tried like below.

print 'length : ',len(self.data.sum(axis = 0)),' all columns : ',len(self.data.columns)

Then i'm getting different length i.e

length : 78 all columns : 83

And I'm getting below warning

C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't return -1 or -2 for exception

And To achieve my goal i tried the other way

for column in self.data.columns:
    sum = self.data[column].sum()
    if( sum < 15 ):
        self.data = self.data.drop(column,1) 

Now i have got the other errors like below:

TypeError: unsupported operand type(s) for +: 'Decimal' and 'float' C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't return -1 or -2 for exception

Then i tried to get the data types of each column like below.

print 'dtypes : ', self.data.dtypes

The result has all the columns are one of these int64 , object and float 64 Then i thought of changing the data type of columns which are in object like below

self.data.convert_objects(convert_numeric=True)

Still i'm getting the same errors, Please help me in solving this.

Note : In all the columns I do not have strings i.e characters and missing values or empty.I have checked this using self.data.to_csv

As i'm new to pandas and python Please don't mind if it is a silly question. I just want to learn


Solution

  • Please review the simple code below and you may understand the reason of the error.

    import pandas as pd
    import numpy as np
    
    
    df = pd.DataFrame(np.random.random([3,3]))
    df.iloc[0,0] = np.nan
    
    print df
    print df.sum(axis=0) > 1.5
    print df.loc[:, df.sum(axis=0) > 1.5]
    
    df.iloc[0,0] = 'string'
    
    print df
    print df.sum(axis=0) > 1.5
    print df.loc[:, df.sum(axis=0) > 1.5]
    
              0         1         2
    0       NaN  0.336250  0.801349
    1  0.930947  0.803907  0.139484
    2  0.826946  0.229269  0.367627
    
    0     True
    1    False
    2    False
    dtype: bool
    
              0
    0       NaN
    1  0.930947
    2  0.826946
    
              0         1         2
    0    string  0.336250  0.801349
    1  0.930947  0.803907  0.139484
    2  0.826946  0.229269  0.367627
    
    1    False
    2    False
    dtype: bool
    
    Traceback (most recent call last):
    ...
    pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
    

    Shortly, you need additional preprocess on your data.

    df.select_dtypes(include=['object'])
    

    If it's convertable string numbers, you can convert it by df.astype(), or you should purge them.