Search code examples
pythonpandascountpandas-groupbycategorical-data

Grouping several columns with the same category into one table in pandas


i have dataset like this

Feature Name Description Data Type
customerID Contains customer ID unique ID, categorical, nominal
OnlineSecurity Whether the customer has online security or not (Yes, No, No internet service) categorical, nominal
OnlineBackup Whether the customer has online backup or not (Yes, No, No internet service) categorical, nominal
DeviceProtection Whether the customer has device protection or not (Yes, No, No internet service) categorical, nominal
TechSupport Whether the customer has tech support or not (Yes, No, No internet service) categorical, nominal
streamingTV Whether the customer has streaming TV or not (Yes, No, No internet service) categorical, nominal
streamingMovies Whether the customer has streaming movies or not (Yes, No, No internet service) categorical, nominal
Contract The contract term of the customer (Month-to-month, One year, Two year) categorical, nominal
PaperlessBilling Whether the customer has paperless billing or not (Yes, No) categorical, nominal
PaymentMethod The customer’s payment method (Electronic check, Mailed check, Bank transfer, Credit card) categorical, nominal
MonthlyCharges The amount charged to the customer monthly    numeric , float
TotalCharges The total amount charged to the customer  numeric, float
Churn Whether the customer churned or not (Yes or No) categorical, nominal

the dataset are from kaggle

as you can see, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport,streamingTV, StreamingMovies have same category ["yes", "No", "No internet service]. I want to groupby all those columns with the expected result like this :

Yes No No internet service
OnlineSecurity 3497 1520 2015
DeviceProtection 3497 1520 2015
TechSupport 3497 1520 2015
streamingTV 3497 1520 2015
streamingTV 3497 1520 2015
StreamingMovies 3497 1520 2015

the number in above table is just random value, i want it to be counting every value in each category for every columns


Solution

  • I couldn't find the table you posted above in the link but I suppose you have it. I copied it into a metadata file.

    # load the data into df and metadata
    df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
    metadata = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_meta.csv')
    

    Then you will have to get the columns which have the same category.

    cols = metadata.loc[metadata['Description'].str.contains('Yes, No, No internet service')]['Feature Name'].tolist()
    

    Here we check for rows that has Yes, No, No internet service in the columns which gives us: ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

    Then I choose to melt the df with the selected columns and group them with groupby and count the values.

    results = (df
        .melt(value_vars=cols)
        .groupby(['variable', 'value'])
        .agg({'value': 'count'})
        .unstack()
        .reset_index()
        .droplevel(level=0, axis=1)
    )
    

    This gives you the output:

    No No internet service Yes
    DeviceProtection 3095 1526 2422
    OnlineBackup 3088 1526 2429
    OnlineSecurity 3498 1526 2019
    StreamingMovies 2785 1526 2732
    StreamingTV 2810 1526 2707
    TechSupport 3473 1526 2044

    (comment below asked for a total column)

    results = (df
        .melt(value_vars=cols)
        .groupby(['variable', 'value'])
        .agg({'value': 'count'})
        .unstack()
        .reset_index()
        .droplevel(level=0, axis=1)
        .assign(total = lambda x: x.sum(axis=1))
    )