i have dataset like this
Feature Name | Description | Data Type |
---|---|---|
customerID | Contains customer ID | unique ID, categorical, nominal |
OnlineSecurity | Whether the customer has online security or not (Yes, No, No internet service) | categorical, nominal |
OnlineBackup | Whether the customer has online backup or not (Yes, No, No internet service) | categorical, nominal |
DeviceProtection | Whether the customer has device protection or not (Yes, No, No internet service) | categorical, nominal |
TechSupport | Whether the customer has tech support or not (Yes, No, No internet service) | categorical, nominal |
streamingTV | Whether the customer has streaming TV or not (Yes, No, No internet service) | categorical, nominal |
streamingMovies | Whether the customer has streaming movies or not (Yes, No, No internet service) | categorical, nominal |
Contract | The contract term of the customer (Month-to-month, One year, Two year) | categorical, nominal |
PaperlessBilling | Whether the customer has paperless billing or not (Yes, No) | categorical, nominal |
PaymentMethod | The customer’s payment method (Electronic check, Mailed check, Bank transfer, Credit card) | categorical, nominal |
MonthlyCharges | The amount charged to the customer monthly | numeric , float |
TotalCharges | The total amount charged to the customer | numeric, float |
Churn | Whether the customer churned or not (Yes or No) | categorical, nominal |
the dataset are from kaggle
as you can see, OnlineSecurity
, OnlineBackup
, DeviceProtection
, TechSupport
,streamingTV
, StreamingMovies
have same category ["yes", "No", "No internet service]. I want to groupby all those columns with the expected result like this :
Yes | No | No internet service | |
---|---|---|---|
OnlineSecurity | 3497 | 1520 | 2015 |
DeviceProtection | 3497 | 1520 | 2015 |
TechSupport | 3497 | 1520 | 2015 |
streamingTV | 3497 | 1520 | 2015 |
streamingTV | 3497 | 1520 | 2015 |
StreamingMovies | 3497 | 1520 | 2015 |
the number in above table is just random value, i want it to be counting every value in each category for every columns
I couldn't find the table you posted above in the link but I suppose you have it. I copied it into a metadata
file.
# load the data into df and metadata
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
metadata = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_meta.csv')
Then you will have to get the columns which have the same category.
cols = metadata.loc[metadata['Description'].str.contains('Yes, No, No internet service')]['Feature Name'].tolist()
Here we check for rows that has Yes, No, No internet service
in the columns which gives us: ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
Then I choose to melt the df
with the selected columns and group them with groupby
and count the values.
results = (df
.melt(value_vars=cols)
.groupby(['variable', 'value'])
.agg({'value': 'count'})
.unstack()
.reset_index()
.droplevel(level=0, axis=1)
)
This gives you the output:
No | No internet service | Yes | |
---|---|---|---|
DeviceProtection | 3095 | 1526 | 2422 |
OnlineBackup | 3088 | 1526 | 2429 |
OnlineSecurity | 3498 | 1526 | 2019 |
StreamingMovies | 2785 | 1526 | 2732 |
StreamingTV | 2810 | 1526 | 2707 |
TechSupport | 3473 | 1526 | 2044 |
(comment below asked for a total column)
results = (df
.melt(value_vars=cols)
.groupby(['variable', 'value'])
.agg({'value': 'count'})
.unstack()
.reset_index()
.droplevel(level=0, axis=1)
.assign(total = lambda x: x.sum(axis=1))
)