Search code examples
kerasdeep-learningneural-network

How to handle discontinuous input distributions in neural network


I am using Keras to setup neural networks. As input data, I use vectors in which each coordinate can be either 0 (feature not present or not measured) or a value that can range for instance between 5000 and 10000.

So my input value distribution is a kind of gaussian centered let us say around 7500 plus a very thin peak at 0.

I cannot remove the vectors with 0 in some of their coordinates because almost all of them will have some 0s at some locations.

So my question is : "how to best normalize the input vectors ?". I see two possibilities :

  1. just substract the mean and divide by standard deviation. The problem then is that the mean is biased by the high number of meaningless 0s, and the std is overestimated, which erases the fine changes in the meaningful measurement.
  2. compute the mean and standard deviation on the non-zeros coordinates, which is more meaningful. But then all the 0 values that correspond to non measured data will come out with high (negative) values which gives some importance to meaningless data...

Does someone have an advice on how to proceed ?

Thanks !


Solution

  • Instead, represent your features as 2 dimensions:

    • First one is normalised value of the feature if it is non zero (where normalisation is computed over non zero elements), otherwise it is 0
    • Second is 1 iff the feature was 0, otherwise it is 1. This makes sure that 0 from the previous feature that could either come from raw 0, or from normalised 0 can be discriminated

    You can think of this as encoding extra feature saying "the other feature is missing". This way scale of each feature is normalised, and all informatino preserved