Search code examples
pythonpandasnumpydataframeanalytics

Extract mean from categorical data type column


For example have this table:

list_1=[['1','y',474.0],
        ['2','n',482.0],
       ['3','n',564.0,],
       ['4','y',549.0,],
       ['5','y',551.0,],
       ['6','y',555.0,],
       ['7','n',600.0,],
       ['8','y',357.0,],
       ['9','y',542.0,],
       ['10','n',462.0,],
       ['11','n',513.0,],
       ['12','y',526.0,]]
labels=['id','order_?','hours_spend']
df=pd.DataFrame(list_1,columns=labels)
df

Result:

    id  order_? hours_spend
0   1   y        474.0
1   2   n        482.0
2   3   n        564.0
3   4   y        549.0
4   5   y        551.0
5   6   y        555.0
6   7   n        600.0
7   8   y        357.0
8   9   y        542.0
9   10  n        462.0
10  11  n        513.0
11  12  y        526.0

I'm going to divide the hours_spend column to 3 groups with NTILE method without lables:

df['ntile']=pd.qcut(df['hours_spend'],3)
df

Result:

id  order_? hours_spend      ntile
0   1   y   474.0        (356.999, 502.667]
1   2   n   482.0        (356.999, 502.667]
2   3   n   564.0        (549.667, 600.0]
3   4   y   549.0        (502.667, 549.667]
4   5   y   551.0        (549.667, 600.0]
5   6   y   555.0        (549.667, 600.0]
6   7   n   600.0        (549.667, 600.0]
7   8   y   357.0        (356.999, 502.667]
8   9   y   542.0        (502.667, 549.667]
9   10  n   462.0        (356.999, 502.667]
10  11  n   513.0        (502.667, 549.667]
11  12  y   526.0        (502.667, 549.667]

Now i have column 'ntile' with data type 'category':

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
id             12 non-null object
order_?        12 non-null object
hours_spend    12 non-null float64
ntile          12 non-null category
dtypes: category(1), float64(1), object(2)
memory usage: 556.0+ bytes

How I can add new column with mean of range in 'ntile' column ??


Solution

  • You can simply define it as:

    df['mean_ntile'] = (df['ntile'].apply(lambda x: x.left).astype(int) + df['ntile'].apply(lambda x: x.right).astype(int))/2
    print(df)
    

    Output:

        id order_?  hours_spend               ntile  mean_ntile
    0    1       y        474.0  (356.999, 502.667]       429.0
    1    2       n        482.0  (356.999, 502.667]       429.0
    2    3       n        564.0    (549.667, 600.0]       574.5
    3    4       y        549.0  (502.667, 549.667]       525.5
    4    5       y        551.0    (549.667, 600.0]       574.5
    5    6       y        555.0    (549.667, 600.0]       574.5
    6    7       n        600.0    (549.667, 600.0]       574.5
    7    8       y        357.0  (356.999, 502.667]       429.0
    8    9       y        542.0  (502.667, 549.667]       525.5
    9   10       n        462.0  (356.999, 502.667]       429.0
    10  11       n        513.0  (502.667, 549.667]       525.5
    11  12       y        526.0  (502.667, 549.667]       525.5
    

    As @ALlolz suggested an easier way is:

    df['mean_ntile'] = df['ntile'].apply(lambda x: x.mid)