Search code examples
statisticsdata-miningbinning

Smooth values using bin Boundaries: Where do you set a value who sits right between the lower and upper boundary?


In response to @j.jerrod.taylor's answer, let me rephrase my question to clear any misunderstanding.

I'm new to Data Mining and am learning about how to handle noisy data by smoothing my data using the Equal-width/Distance Binning method via "Bin Boundaries". Assume the dataset 1,2,2,3,5,6,6,7,7,8,9. I want to perform:

  1. distance binning with 3 bins, and
  2. Smooth values by Bin Boundaries based on values binned in #1.

Based on definition in (Han,Kamber,Pei, 2012, Data Mining Concepts and Techniques, Section 3.2.2 Noisy Data):

In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.

  • Interval width = (max-min)/k = (9-1)/3 = 2.7
  • Bin intervals = [1,3.7),[3.7,6.4),[6.4,9.1]

  • original Bin1: 1,2,2,3 | Bin boundaries: (1,3) | Smooth values by Bin Boundaries: 1,1,1,3

  • original Bin2: 5,6,6 | Bin boundaries: (5,6) | Smooth values by Bin Boundaries: 5,6,6
  • original Bin3: 7,7,8,9 | Bin boundaries: (7,9) | Smooth values by Bin Boundaries: 7,7,8,9

Question: - where does 8 belong to in Bin3 when binned using Bin boundaries method, since it's +1 from 7 and -1 from 9?


Solution

  • UPDATE WITH CORRECT ANSWER:

    My class finally covered this topic, and the answer to my own question is that 8 can belong to either 7 or 9. This scenario is described as "tie-breaking", where the value is equal distance from either boundary. It is acceptable for all such values to be consistently tied to the same boundary.

    Here's is a real example of a NIH analysis paper that explains using "tie breaking" when they encounter equal-distance values: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3807594/