Search code examples
pythonpandaspercentiledata-preprocessingquartile

How can I efficiently get [many] quartiles?


I need to encode numerical values by ranges: low: 0, medium: 1, high: 2 , very high: 3. I'm doing it for quartiles. I have the following code:

import pandas as pd
import numpy as np

def fun(df):
    table = df.copy() # pandas dataframe
    N = int(table.shape[0])
    for header in list(table.columns):
        q1 = np.percentile(table[header], 25)
        q2 = np.percentile(table[header], 50)
        q3 = np.percentile(table[header], 75)
        for k in range(0, N):
            if( table[header][k] < q1 ):
                table[header][k] = int(0)
            elif( (table[header][k] >= q1) & (table[header][k] < q2)):
                table[header][k] = int(1)
            elif( (table[header][k] >= q2) & (table[header][k] < q3)):
                table[header][k] = int(2)
            else:
                table[header][k] = int(3)
        pass
    table = table.astype(int)
    return table

Proof

df = pd.DataFrame( {
        'A': [30, 28, 32, 25, 25, 25, 22, 24, 35, 40],
        'B': [25, 30, 27, 40, 42, 40, 50, 45, 30, 25],
        'C': [25.5, 30.1, 27.3, 40.77, 25.1, 25.34, 22.11, 23.81, 33.66, 38.56],
    }, columns = [ 'A', 'B', 'C' ] )

Result:

A  B  C
2  0  1
2  1  2
3  0  2
1  2  3
1  3  0
1  2  1
0  3  0
0  3  0
3  1  3
3  0  3

Any way to do the same, efficiently?


Solution

  • You could use a combination of np.digitize and pd.rank

    In [569]: np.digitize(df.rank(pct=True), bins=[.25, .5, .75], right=True)
    Out[569]:
    array([[2, 0, 1],
           [2, 1, 2],
           [3, 1, 2],
           [1, 2, 3],
           [1, 3, 1],
           [1, 2, 1],
           [0, 3, 0],
           [0, 3, 0],
           [3, 1, 3],
           [3, 0, 3]], dtype=int64)
    

    Details

    In [570]: df.rank(pct=True)
    Out[570]:
         A     B    C
    0  0.7  0.15  0.5
    1  0.6  0.45  0.7
    2  0.8  0.30  0.6
    3  0.4  0.65  1.0
    4  0.4  0.80  0.3
    5  0.4  0.65  0.4
    6  0.1  1.00  0.1
    7  0.2  0.90  0.2
    8  0.9  0.45  0.8
    9  1.0  0.15  0.9
    
    In [571]: pd.DataFrame(np.digitize(df.rank(pct=True), bins=[.25, .5, .75], right=True),
                           columns=df.columns)
    Out[571]:
       A  B  C
    0  2  0  1
    1  2  1  2
    2  3  1  2
    3  1  2  3
    4  1  3  1
    5  1  2  1
    6  0  3  0
    7  0  3  0
    8  3  1  3
    9  3  0  3