Search code examples
pythonopencvscikit-learncluster-analysisdbscan

For DBSCAN python, is it mandatory to do Standardization and normalization both?


For DBSCAN implementation, is it necessary to have all the feature columns Standardized AND Normalized?

e.g.

[[ 664.      ,  703.      , 2901.069079],  
[ 632.      ,  717.      , 2901.069079],  
[ 606.      ,  740.      , 4386.449399],    
[ 635.      ,  751.      , 4386.449399],   
[ 672.      ,  525.      , 4760.874001]]

If I have to do DBSCAN on this, is it mandatory to Standardize it firs then Normalize it? Just Normalize it?

Additionally, How these values dictate the choice of eps?


Solution

  • Normalizing or standardizing your data can ruin important properties of your data set.

    Some examples:

    • your data are geo coordinates. Latitute and longitude must never be normalized or standardized
    • your data are histograms. The only meaningful normalization is to make the sum of the histogram 1. Never transform single variables!
    • your data has a meaningful zero. For example, it is a monetary value. Transforming with sgn(x)*sqrt(abs(x)) may be helpful in some domains though.
    • your data is sparse. Never standardize. (Normalization may be 'okay' if you do not have negative values.)

    Choosing a scaling should not be done "because it is always done"; but because of the actual data you have! Choose it because it is the right thing, not because it is "default" or in some tutorial.

    Most likely if you resort to normalization or standardization, you have not understood your data, nor how to measure distance or similarity; then people like to use normalization as a last resort to get "some" result; but you never know if the result is meaningful at all.