I'm trying to calculate the entropy over a pandas series. Specifically, I group the strings in Direction
as a sequence. Specifically, using this function:
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
will return the count of strings in Direction
that are the same until a change. So for each sequence of the same Direction
string, I want to calculate the entropy of X,Y
.
Using the code the sequencing of the same string is:
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
This code used to work but it's now returning an error. I'm not sure if this was after an upgrade.
import pandas as pd
import numpy as np
def ApEn(U, m = 2, r = 0.2):
'''
Approximate Entropy
Quantify the amount of regularity over time-series data.
Input parameters:
U = Time series
m = Length of compared run of data (subseries length)
r = Filtering level (tolerance). A positive number
'''
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [U.tolist()[i:i + m] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return (N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
def Entropy(df):
'''
Calculate entropy for individual direction
'''
df = df[['Time','Direction','X','Y']]
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby(diff_dir)['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby(diff_dir)['Y'].transform(ApEn)
return df
df = pd.DataFrame(np.random.randint(0,50, size = (10, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Left','Left','Left','Left','Right','Right','Right','Left','Left']
df['Direction'] = direction
# Calculate defensive regularity
entropy = Entropy(df)
Error:
return (N - m + 1.0)**(-1) * sum(np.log(C))
ZeroDivisionError: 0.0 cannot be raised to a negative power
The issue is because of the below code
(N - m + 1.0)**(-1)
consider the situation when N==1
and since N = len(U)
this happens when the a group resulted out of groupby will have size of 1. Since m==2
this end up as
(1-2+1)**-1 == 0
And we 0**-1
is undefined as so the error.
Now if we look theoretically, how do you define a approximate entropy of a timeseries with just one value; highly unpredictable so it should be as high as possible. For this case let us set it to np.nan
to denote it is not defined (entropy is always greater then equal to 0)
import pandas as pd
import numpy as np
def ApEn(U, m = 2, r = 0.2):
'''
Approximate Entropy
Quantify the amount of regularity over time-series data.
Input parameters:
U = Time series
m = Length of compared run of data (subseries length)
r = Filtering level (tolerance). A positive number
'''
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [U.tolist()[i:i + m] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
if (N - m + 1) == 0:
return np.nan
return (N - m + 1)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
def Entropy(df):
'''
Calculate entropy for individual direction
'''
df = df[['Time','Direction','X','Y']]
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby(diff_dir)['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby(diff_dir)['Y'].transform(ApEn)
return df
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50, size = (10, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Left','Left','Left','Left','Right','Right','Right','Left','Left']
df['Direction'] = direction
# Calculate defensive regularity
print (Entropy(df))
Output:
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 6 16 0.287682 0.287682
1 2 Left 22 6 0.287682 0.287682
2 3 Left 16 5 0.287682 0.287682
3 4 Left 5 48 0.287682 0.287682
4 5 Left 11 21 0.287682 0.287682
5 6 Right 44 25 0.693147 0.693147
6 7 Right 14 12 0.693147 0.693147
7 8 Right 43 40 0.693147 0.693147
8 9 Left 46 44 NaN NaN
9 10 Left 49 2 NaN NaN
Larger sample (which results in 0**-1 issue)
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50, size = (100, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Right','Up','Down']
df['Direction'] = np.random.choice((direction), len(df))
print (Entropy(df))
Output:
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 44 47 NaN NaN
1 2 Left 0 3 NaN NaN
2 3 Down 3 39 NaN NaN
3 4 Right 9 19 NaN NaN
4 5 Up 21 36 NaN NaN
.. ... ... .. .. ... ...
95 96 Up 19 33 NaN NaN
96 97 Left 40 32 NaN NaN
97 98 Up 36 6 NaN NaN
98 99 Left 21 31 NaN NaN
99 100 Right 13 7 NaN NaN