Search code examples
pythonpython-3.xnumpynonetype

Reasonable way to have different versions of None?


Working in Python3.

Say you have a million beetles, and your task is to catalogue the size of their spots. So you will make a table, where each row is a beetle and the number in the row represent the size of spots;

 [[.3, 1.2, 0.5],
  [.6, .7],
  [1.4, .9, .5, .7],
  [.2, .3, .1, .7, .1]]

Also, you decide to store this in a numpy array, for which you pad the lists with None (numpy will convert this to np.nan).

 [[.3, 1.2, 0.5, None, None],
  [.6, .7, None, None, None],
  [1.4, .9, .5, .7, None],
  [.2, .3, .1, .7, .1]]

But there is a problem, values represented as None can be None for one of 3 reasons;

  1. The beetle dosn't have many spots; that quantity does not exist.

  2. The beetle won't stay still and you can't measure the spot.

  3. You haven't got round to measuring that beetle yet, so the value is unassigned.

My problem doesn't actually involve beetles, but the principles are the same. I want 3 different None values so I can keep these missing value causes distinct. My current solution is to use a value so large that it is physically improbable, but this is not a very safe solution.

Assume you cannot use negative numbers - in reality the quantity I am measuring could be negative.

The data is big and read speed is important.

Edit; comments rightly point out that saying speed is important without saying what operations is a bit meaningless. Principle component analysis is probably going to be used for variable decorrilation, Euclidean distance squared calculations for a clustering algorithm (but the data is sparse in that variable) possibly some interpolation. Eventually a recursive neural network, but that will come from a library so I will just have to but the data into an input form. So maybe nothing worse than linear algebra, it should all fit in the RAM if I am careful I think.

What is a good strategy?


Solution

  • It was suggested to create three different object instances for each of your case.

    Since your want those objects to have the properties of NaN, you may try creating three different NaN instances.

    NOT_APPLICABLE = float("nan")
    NOT_MEASURED = float("nan")
    UNKNOWN = float("nan")
    

    This is at the limit of being a hack, so use at your own risk, but I don't believe any Python implementation optimizes NaN to always reuse the same object. You can nonetheless add a sentinel condition to check that before running.

    if NOT_APPLICABLE is NOT_MEASURED or NOT_MEASURED is UNKNOWN or UNKNOWN is NOT_APPLICABLE :
        raise ValueError # or try something else
    

    If this works this has the advantage of allowing you to compare the NaN id to check its meaning.

    row = [1.0, 2.4, UNKNOWN]
    
    ...
    
    if value is UNKNOWN:
        ...
    

    Meanwhile, it preserves any optimization numpy may do with its array.

    Disclosure: this is a hacky suggestion, I am eager to hear from others about this.