Search code examples
pythonprecisionoutliersquantizationdata-transform

Method to quantize a range of values to keep precision when signficant outliers are present in the data


Could you tell me please if there is a suitable quantizing method in the following case (preferrably implemented in python)?

There is an input range where majority of values are within +-2 std from mean, while some huge outliers are present. E.g. [1, 2, 3, 4, 5, 1000] Quantizing it to output range of e.g. 0-255 would result in loss of precision because of huge outlier 1000 (1, 2, 3, 4, 5 will all become 0).

However, it is important to keep precision for those values which are within several std from mean.

Throwing away the outliers or replacing them with NaN is not acceptable. They should be kept in some form. Roughly, using example above, output of quantization should be something like [1, 2, 3, 4, 5, 255]

Thank you very much for any input.


Solution

  • I can think of 2 answers to your question.

    1. You write "huge outlier". The term outlier suggest that this number does not really fit the data. If you really have evidence that this observation is not representative (say because the measurement device was broken temporarily), then I would omit this observation.
    2. Alternatively, such high values might occur because this variable can truly span a large range of outcomes (e.g. an income variable with Elon Musk in the sample). In this situation I would consider a transformation of the input, say take the logarithm of the numbers first. This would transform your list [1,2,3,4,5,1000] to [0,0.69,1.10,1.39,1.61,6.91]. These values are already closer together.

    However, regardless of choices 1 or 2, it is probably best to anyways compare the outcomes with and without this outlier. You really want to avoid your conclusions being driven by this single observation.