Search code examples
pythonpandasfilegroup-bydataset

Pandas Groupby python


I have a dataset with the names of the countries and some other information such as salary in the file. The problem is that I need to find mean salaries of employees in the file grouped by country_id and city in ranges (0, 5000] (5000, 10000] and (10000, 15000]. I was using this method but the resultant table is not as what I want. Can you help me with that?

df = file.groupby(['country_id',"city"])['salary'].mean().reset_index(name="mean") 
bins = [0, 5000]
df['binned'] = pd.cut(df['mean'], bins)
print(df)

Solution

  • I think if your range of the salary bin is always 5000, you can create the bin number of each row by using / operator and math.ceil

    import math
    df = pd.DataFrame({
        'salary':[
            1231, 5000, 7304, 10000, 12302, 15000,
            1001, 4900, 6012, 9123, 11231, 14923
        ],
        'country': [
            'USA','USA','USA','USA','USA','USA',
            'UK','UK','UK','UK','UK','UK'
        ]
    })
    df['salary_bin_number'] = (df['salary'] / 5000).apply(lambda x: math.ceil(x))
    df.head()
    
    
    salary country salary_bin_number
    1231 USA 1
    5000 USA 1
    7304 USA 2
    10000 USA 2
    12302 USA 3

    With the salary_bin_number, you can create the column name of bin by using below code

    df['salary_range_str'] = df['salary_bin_number'].apply(
        lambda bin_number: f'({(bin_number-1) * 5000}-{(bin_number) * 5000}]'
    )
    

    Then group by salary_range_str and country to calculate the average salary in each country,salary_range_str.

    df = df.groupby(['country', 'salary_range_str'])['salary'].mean().reset_index()
    

    Finally, pivot the column salary_range_str to columns.

    df = pd.pivot_table(df, index='country', columns='salary_range_str', values='salary')
    

    Output

    country (0-5000] (10000-15000] (5000-10000]
    UK 2950.5 13077 7567.5
    USA 3115.5 13651 8652