Pythonic code for conditional loop over array (Market Meaness Index)

I'm just starting out with Python, all my previous experience being C++ type languages.

In the attempt to learn "good" Python I've been trying to convert this C-like function into Python.

var MMI(var *Data,int Length)
{
  var m = Median(Data,Length);
  int i, nh=0, nl=0;
  for(i=1; i<Length; i++) {
    if(Data[i] > m && Data[i] > Data[i-1])
      nl++;
    else if(Data[i] < m && Data[i] < Data[i-1])
      nh++;
  }
  return 100.*(nl+nh)/(Length-1);
}

I'm pretty sure I can do it easily with a for loop, but I've been trying to do it using a series of array operations, rather than an explicit loop. I came up with:

import numpy as np
import pandas as pd
from pandas import Series 

def MMI( buffer, mmi_length ):
    window = Series( buffer[:mmi_length] )
    m      = window.median()

    nh = np.logical_and( [window > m], [window > window.shift(1)] ) 
    nl = np.logical_and( [window < m], [window < window.shift(1)] ) 
    nl = np.logical_and( [not nh], [nl] )

    return 100 * ( nh.sum() + nl.sum() ) / mmi_length

The final np.logical_and( [not nh], [n] ) gives a "truth value ambiguous" error, which I don't understand, but more importantly I'm not sure whether this approach will actually yield a valid result in Python.

Could someone provide a pointer in how I should code this elegantly, or slap me on the head and tell me to just use a loop?

Ian

Solution

Python is implicit, unlike C++ where you almost have to declare everything. Python, and numpy/pandas or other modules, have a ton of optimized functionality built-in - in order for you to work without a lot of loops or value by value comparison (what the modules do in the background, often is a for loop though - so don't think that it's necessarily faster, it's often just a pretty cover).

Now, let's look at your code

import numpy as np # no need for pandas here


def MMI( buffer, mmi_length ):
    # we will need to define two arrays here,
    # shift(n) does not do what you want
    window = np.asarray(buffer[1:mmi_length])
    window_shifted = np.asarray(buffer[:mmi_length-1])
    m = np.median(window)

    # instead using all these explicit functions simply do:
    nh = (window > m) & (window > window_shifted) 
    nl = (window < m) & (window < window_shifted) 
    nl = ~nh & nl                                # ~ inverts a lot of things,
                                                 # one of them are boolean arrays

    # this does the right thing
    return 100*(nh.sum()+nl.sum())/mmi_length

Now let's explain:

A Series is basically an array, in this context a series seems like an overkill. If you compare such an object with a scalar, you will get an array of booleans, expressing which value met the condition and which didn't (the same goes for comparing two arrays, it will result in boolean array expressing the value by value comparison).

In the first step, you compare an array to a scalar (remember, this will be a boolean array) and another array to a another array (we'll get to the shift part) and then want to logically and combine the result of the comparisons. Good thing is, that you want to combine two boolean arrays, this will work implicitly by the & operation. The second step is analogous and will work implicitly the same.

In the third step, you want to invert a boolean array and combine it with another boolean array. Inversion is done by the ~ operator and can be used in lot's of other places to (e.g. for inverting subset selections, etc). You cannot use the not operator here, since its purpose is to convert its argument into a truth value (True/False) and return the opposite - but what is the truth value of an array? The logical and combination of all components? It's not defined, therefore you get the ambiguous error.

The sum() of a boolean array is always the count of True values in the array, thus it will yield the right results.

The only problem with your code, is that if you apply shift(1) to this Series, it will prepend a NaN and truncate the last element of the Series, so that you end up with an equal length object. Now your comparisons are not yielding what you want anymore, because anything compared to a numpy.NaN will return False. In order to overcome that, you can simply define a second array in the beginning (which then makes pandas obsolete), using the same syntax you already used for window before.

PS: a numpy array is not a python list (all of the above are numpy arrays!) A numpy array is a complex object that allows for all these operations, with standard python lists, you have to work your own for loops