Search code examples
pythonpandasbins

I have been trying to qcut an array of values into 4 bins. I am getting the error below? How to solve this I am a beginner in Python


Below is my array data: wkx_old['Sales point'].values

array([ 2, 2, 2, 4, 4, 3, 1, 4, 2, 1, 3, 4, 1, 1, 4, 7, 4, 1, 1, 2, 4, 3, 4, 3, 3, 2, 5, 2, 3, 2, 3, 4, 2, 10, 4, 4, 6, 3, 3, 1, 1, 2, 1, 3, 2, 4, 5, 2, 4, 3, 2, 3, 4, 3, 1, 1, 6, 3, 6, 5, 7, 2, 1, 1, 6, 5, 1, 1, 1, 2, 2, 1, 2, 2, 4, 4, 1, 5, 7, 2, 1, 2, 1, 5, 3, 1, 1, 2, 3, 3, 5, 4, 4, 6, 1, 4, 4, 1, 3, 4, 4, 5, 4, 4, 1, 1, 3, 1, 2, 1, 3, 7, 2, 1, 1, 3, 3, 6, 1, 6, 2, 3, 7, 1])

Trying to compute below code:

names=['D','C','B','A']

wkx_old['Rankings'] = pd.qcut(wkx_old['Sales point'],q=4,labels=names)

The error I am getting: ValueError: Bin edges must be unique: array([ 1., 1., 3., 4., 10.]). You can drop duplicate edges by setting the 'duplicates' kwarg


Solution

  • qcut is not friendly with duplicated data and will throw an error when it sees a duplicate at splitting point. Imagine you do a qcut on [1]*100, what is the 50-th percentile?

    You can try rank(pct=True) to calculate the actual percentile for the value, then cut:

    wkx_old['Rankings'] = pd.cut(wkx_old['Sales point'].rank(pct=True), 
                                 bins=4, labels=names)
    

    Output:

    0      C
    1      C
    2      C
    3      B
    4      B
          ..
    119    A
    120    C
    121    C
    122    A
    123    D
    Length: 124, dtype: category
    Categories (4, object): ['D' < 'C' < 'B' < 'A']