python variables cluster-analysis churn bigdata

Python: Cluster analysis on a monthly data with a lot of variables

I hope you guys can help me sort this out as I feel this is above me. It might be silly for some of you, but I am lost and I come to you for advice.

I am new to statistics, data analysis and big data. I just started studying and I need to make a project on churn prediction. Yes, this is sort of a homework task, but I hope you can answer some of my questions.

I would be most grateful for a beginner-level answers step-by-step.

Basically, I have a very big data set (obviously) on customer activity data from cellular company for 3 months, the 4th month ending in churned or not churned. Each month has these columns:

['year',
 'month',
 'user_account_id',
 'user_lifetime',
 'user_intake',
 'user_no_outgoing_activity_in_days',
 'user_account_balance_last',
 'user_spendings',
 'user_has_outgoing_calls',
 'user_has_outgoing_sms',
 'user_use_gprs',
 'user_does_reload',
 'reloads_inactive_days',
 'reloads_count',
 'reloads_sum',
 'calls_outgoing_count',
 'calls_outgoing_spendings',
 'calls_outgoing_duration',
 'calls_outgoing_spendings_max',
 'calls_outgoing_duration_max',
 'calls_outgoing_inactive_days',
 'calls_outgoing_to_onnet_count',
 'calls_outgoing_to_onnet_spendings',
 'calls_outgoing_to_onnet_duration',
 'calls_outgoing_to_onnet_inactive_days',
 'calls_outgoing_to_offnet_count',
 'calls_outgoing_to_offnet_spendings',
 'calls_outgoing_to_offnet_duration',
 'calls_outgoing_to_offnet_inactive_days',
 'calls_outgoing_to_abroad_count',
 'calls_outgoing_to_abroad_spendings',
 'calls_outgoing_to_abroad_duration',
 'calls_outgoing_to_abroad_inactive_days',
 'sms_outgoing_count',
 'sms_outgoing_spendings',
 'sms_outgoing_spendings_max',
 'sms_outgoing_inactive_days',
 'sms_outgoing_to_onnet_count',
 'sms_outgoing_to_onnet_spendings',
 'sms_outgoing_to_onnet_inactive_days',
 'sms_outgoing_to_offnet_count',
 'sms_outgoing_to_offnet_spendings',
 'sms_outgoing_to_offnet_inactive_days',
 'sms_outgoing_to_abroad_count',
 'sms_outgoing_to_abroad_spendings',
 'sms_outgoing_to_abroad_inactive_days',
 'sms_incoming_count',
 'sms_incoming_spendings',
 'sms_incoming_from_abroad_count',
 'sms_incoming_from_abroad_spendings',
 'gprs_session_count',
 'gprs_usage',
 'gprs_spendings',
 'gprs_inactive_days',
 'last_100_reloads_count',
 'last_100_reloads_sum',
 'last_100_calls_outgoing_duration',
 'last_100_calls_outgoing_to_onnet_duration',
 'last_100_calls_outgoing_to_offnet_duration',
 'last_100_calls_outgoing_to_abroad_duration',
 'last_100_sms_outgoing_count',
 'last_100_sms_outgoing_to_onnet_count',
 'last_100_sms_outgoing_to_offnet_count',
 'last_100_sms_outgoing_to_abroad_count',
 'last_100_gprs_usage']

The end result for this homework would be k-means cluster analysis and churn prediction model.

My biggest headache regarding this dataset is:

How to make a cluster analysis for monthly data including most of these variables? I tried to look for an example, but I either found an example on analyzing one variable per month or many variables per one month.

I am using Python and Spark.

I think I can make it work as long as I know what to do with months and a huge list of variables.

Thanks, your help will be greatly appreciated!

P.S. Would a code example be too much to ask?

Solution

Why would you use k-means here?

k-means will not do anything meaningful on such data. It's too sensitive to scaling and attribute types (e.g. year, month)
Churn prediction is a supervised problem. Never use an unsupervised algorithm for a supervised problem. That means you are ignoring the single most valueable information you have to guide the search.