Search code examples
pythonpandasentropy

Error calculating entropy over pandas series


I'm trying to calculate the entropy over a pandas series. Specifically, I group the strings in Direction as a sequence. Specifically, using this function:

diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()

will return the count of strings in Direction that are the same until a change. So for each sequence of the same Direction string, I want to calculate the entropy of X,Y.

Using the code the sequencing of the same string is:

0    1
1    1
2    1
3    1
4    1
5    2
6    2
7    2
8    3
9    3

This code used to work but it's now returning an error. I'm not sure if this was after an upgrade.

import pandas as pd
import numpy as np

def ApEn(U, m = 2, r = 0.2):

    '''
    Approximate Entropy 

    Quantify the amount of regularity over time-series data.

    Input parameters:
    
    U = Time series
    m = Length of compared run of data (subseries length)
    r = Filtering level (tolerance). A positive number

    '''

    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

    def _phi(m):
        x = [U.tolist()[i:i + m] for i in range(N - m + 1)] 
        C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
        return (N - m + 1.0)**(-1) * sum(np.log(C))

    N = len(U)

    return abs(_phi(m + 1) - _phi(m))

def Entropy(df):

    '''
    Calculate entropy for individual direction
    '''

    df = df[['Time','Direction','X','Y']]
                                    
    diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()

    # Calculate ApEn grouped by direction. 
    df['ApEn_X'] = df.groupby(diff_dir)['X'].transform(ApEn)
    df['ApEn_Y'] = df.groupby(diff_dir)['Y'].transform(ApEn)                 

    return df


df = pd.DataFrame(np.random.randint(0,50, size = (10, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)

direction = ['Left','Left','Left','Left','Left','Right','Right','Right','Left','Left']
df['Direction'] = direction


# Calculate defensive regularity
entropy = Entropy(df)

Error:

return (N - m + 1.0)**(-1) * sum(np.log(C))
ZeroDivisionError: 0.0 cannot be raised to a negative power

Solution

  • The issue is because of the below code

    (N - m + 1.0)**(-1)
    

    consider the situation when N==1 and since N = len(U) this happens when the a group resulted out of groupby will have size of 1. Since m==2 this end up as

    (1-2+1)**-1 == 0
    

    And we 0**-1 is undefined as so the error.

    Now if we look theoretically, how do you define a approximate entropy of a timeseries with just one value; highly unpredictable so it should be as high as possible. For this case let us set it to np.nan to denote it is not defined (entropy is always greater then equal to 0)

    code

    import pandas as pd
    import numpy as np
    
    def ApEn(U, m = 2, r = 0.2):
    
        '''
        Approximate Entropy 
    
        Quantify the amount of regularity over time-series data.
    
        Input parameters:
        
        U = Time series
        m = Length of compared run of data (subseries length)
        r = Filtering level (tolerance). A positive number
    
        '''
    
        def _maxdist(x_i, x_j):
            return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
    
        def _phi(m):
            x = [U.tolist()[i:i + m] for i in range(N - m + 1)] 
            C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
            if (N - m + 1) == 0:
              return np.nan
            return (N - m + 1)**(-1) * sum(np.log(C))
    
        N = len(U)
    
        return abs(_phi(m + 1) - _phi(m))
    
    def Entropy(df):
    
        '''
        Calculate entropy for individual direction
        '''
    
        df = df[['Time','Direction','X','Y']]
                                        
        diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
    
        # Calculate ApEn grouped by direction. 
        df['ApEn_X'] = df.groupby(diff_dir)['X'].transform(ApEn)
        df['ApEn_Y'] = df.groupby(diff_dir)['Y'].transform(ApEn)
    
        return df
    
    np.random.seed(0)
    df = pd.DataFrame(np.random.randint(0,50, size = (10, 2)), columns=list('XY'))
    df['Time'] = range(1, len(df) + 1)
    
    direction = ['Left','Left','Left','Left','Left','Right','Right','Right','Left','Left']
    df['Direction'] = direction
    
    # Calculate defensive regularity
    print (Entropy(df))
    

    Output:

       Time Direction   X   Y    ApEn_X    ApEn_Y
    0     1      Left   6  16  0.287682  0.287682
    1     2      Left  22   6  0.287682  0.287682
    2     3      Left  16   5  0.287682  0.287682
    3     4      Left   5  48  0.287682  0.287682
    4     5      Left  11  21  0.287682  0.287682
    5     6     Right  44  25  0.693147  0.693147
    6     7     Right  14  12  0.693147  0.693147
    7     8     Right  43  40  0.693147  0.693147
    8     9      Left  46  44       NaN       NaN
    9    10      Left  49   2       NaN       NaN
    

    Larger sample (which results in 0**-1 issue)

    np.random.seed(0)
    df = pd.DataFrame(np.random.randint(0,50, size = (100, 2)), columns=list('XY'))
    df['Time'] = range(1, len(df) + 1)
    direction = ['Left','Right','Up','Down']
    df['Direction'] = np.random.choice((direction), len(df))
    print (Entropy(df))
    

    Output:

        Time Direction   X   Y  ApEn_X  ApEn_Y
    0      1      Left  44  47     NaN     NaN
    1      2      Left   0   3     NaN     NaN
    2      3      Down   3  39     NaN     NaN
    3      4     Right   9  19     NaN     NaN
    4      5        Up  21  36     NaN     NaN
    ..   ...       ...  ..  ..     ...     ...
    95    96        Up  19  33     NaN     NaN
    96    97      Left  40  32     NaN     NaN
    97    98        Up  36   6     NaN     NaN
    98    99      Left  21  31     NaN     NaN
    99   100     Right  13   7     NaN     NaN