python pandas count pandas-groupby categorical-data

Grouping several columns with the same category into one table in pandas

i have dataset like this

Feature Name	Description	Data Type
customerID	Contains customer ID	unique ID, categorical, nominal
OnlineSecurity	Whether the customer has online security or not (Yes, No, No internet service)	categorical, nominal
OnlineBackup	Whether the customer has online backup or not (Yes, No, No internet service)	categorical, nominal
DeviceProtection	Whether the customer has device protection or not (Yes, No, No internet service)	categorical, nominal
TechSupport	Whether the customer has tech support or not (Yes, No, No internet service)	categorical, nominal
streamingTV	Whether the customer has streaming TV or not (Yes, No, No internet service)	categorical, nominal
streamingMovies	Whether the customer has streaming movies or not (Yes, No, No internet service)	categorical, nominal
Contract	The contract term of the customer (Month-to-month, One year, Two year)	categorical, nominal
PaperlessBilling	Whether the customer has paperless billing or not (Yes, No)	categorical, nominal
PaymentMethod	The customer’s payment method (Electronic check, Mailed check, Bank transfer, Credit card)	categorical, nominal
MonthlyCharges	The amount charged to the customer monthly	numeric , float
TotalCharges	The total amount charged to the customer	numeric, float
Churn	Whether the customer churned or not (Yes or No)	categorical, nominal

the dataset are from kaggle

as you can see, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport,streamingTV, StreamingMovies have same category ["yes", "No", "No internet service]. I want to groupby all those columns with the expected result like this :

	Yes	No	No internet service
OnlineSecurity	3497	1520	2015
DeviceProtection	3497	1520	2015
TechSupport	3497	1520	2015
streamingTV	3497	1520	2015
streamingTV	3497	1520	2015
StreamingMovies	3497	1520	2015

the number in above table is just random value, i want it to be counting every value in each category for every columns

Solution

I couldn't find the table you posted above in the link but I suppose you have it. I copied it into a metadata file.

# load the data into df and metadata
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
metadata = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_meta.csv')

Then you will have to get the columns which have the same category.

cols = metadata.loc[metadata['Description'].str.contains('Yes, No, No internet service')]['Feature Name'].tolist()

Here we check for rows that has Yes, No, No internet service in the columns which gives us: ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

Then I choose to melt the df with the selected columns and group them with groupby and count the values.

results = (df
    .melt(value_vars=cols)
    .groupby(['variable', 'value'])
    .agg({'value': 'count'})
    .unstack()
    .reset_index()
    .droplevel(level=0, axis=1)
)

This gives you the output:

	No	No internet service	Yes
DeviceProtection	3095	1526	2422
OnlineBackup	3088	1526	2429
OnlineSecurity	3498	1526	2019
StreamingMovies	2785	1526	2732
StreamingTV	2810	1526	2707
TechSupport	3473	1526	2044

(comment below asked for a total column)

results = (df
    .melt(value_vars=cols)
    .groupby(['variable', 'value'])
    .agg({'value': 'count'})
    .unstack()
    .reset_index()
    .droplevel(level=0, axis=1)
    .assign(total = lambda x: x.sum(axis=1))
)