python pandas numpy scikit-learn data-mining

Predicting customers intent

I got this Prospects dataset:

ID     Company_Sector         Company_size  DMU_Final  Joining_Date  Country
65656  Finance and Insurance       10        End User   2010-04-13   France
54535  Public Administration       1         End User   2004-09-22   France

and Sales dataset:

ID    linkedin_shared_connections   online_activity  did_buy   Sale_Date
65656            11                        65           1      2016-05-23
54535            13                        100          1      2016-01-12

I want to build a model which assigns to each prospect in the Prospects table the probability of becoming a customer. The model will predict if a prospect going to buy, and return the probability. the Sales table gives info about 2015 sales. My approach-the 'did buy' column should be a label in the model because 1 represents that prospect bought in 2016, and 0 means no sale. another interesting column is the online activity that ranges from 5 to 685. the higher it is- the more active the prospect is about the product. so I'm trying maybe to do Random Forest model and then somehow put the probability for each prospect in the new intent column. Is a Random Forest an efficient model in this case or maybe I should use another one. How can I apply the model results into the new 'intent' column for each prospect in the first table.

Solution

TL;DR: Random forests are nice but seem to be inappropriate due to unbalanced data. You should read about recommender systems, and more fashioned good-performing models like Wide and Deep

An answer depends on: How much data do you have? What are your available data during inference? could you see the current "online_activity" attribute of the potential sale, before the customer is buying? many questions may change the whole approach that fits for your task.

Suggestion:

Generally speaking, these is a kind of business where you usually deal with very unbalanced data - low number of "did_buy"=1 against huge number of potential customers.

On the data science side, you should define valuable metric for success that can be mapped to money directly as possible. Here, it seems that taking actions by advertising or approaching to more probable customers can rise the "did_buy" / "was_approached" is a great metric for success. Overtime, you succeed if you rise that number.

Another thing to take into account, is your data may be sparse. I do not know how much buys you usually get, but it can be that you have only 1 from each country etc. That should also be taken into consideration, since simple random forest can be easily targeting this column in most of its random models and overfitting will be come a big issue. Decision trees suffer from unbalanced datasets. However, by taking the probability of each label in the leaf, instead of a decision, can sometimes be helpful for simple interpretable models and it reflects the unbalanced data. To be honest, I do not truly believe this is the right approach.

If I where you:

I would first embed the Prospects columns to a vector by:

Converting categories to random vectors (for each category) or one-hot encoding.
Normalizing or bucketizing company sizes into numbers that fits the prediction model (next)
Same ideas regarding dates. Here, maybe year can be problematic but months/days should be useful.
Country is definitely categorical, maybe add another "unknown" country class.

Then,

I would use a model that can be actually optimized according to different costs. Logistic regression is a wide one, deep neural network is another option, or see Google's Wide and deep for a combination.
Set the cost to be my golden number (the money-metric in terms of labels), or something as close as possible.
Run experiment

Finally,

Inspect my results and why it failed.
Suggest another model/feature
Repeat.
Go eat launch.
Ask a bunch of data questions.
Try to answer at least some.
Discover new interesting relations in the data.
Suggest something interesting.
Repeat (tomorrow).