machine-learning regression linear-regression

Click revenue prediction model

I'm trying to build a model for eCommerce that would predict revenue for single click that comes via online-marketing channels (e.g. google shopping). Clicks are aimed for product detail pages so my training data consists of product details like: price, delivery time, category, manufacturer. Every historical click also has attached revenue to it. The problem is that revenue equals zero for more that 95% of clicks.

Historical data would look like this:

click_id | manufacturer | category | delivery_time | price | revenue
1 |man1 | cat1 | 24 | 100 | 0
2 |man1 | cat1 | 24 | 100 | 0
3 |man1 | cat1 | 24 | 100 | 0
4 |man1 | cat1 | 24 | 100 | 120
5 |man2 | cat2 | 48 | 200 | 0

As you can see, it's possible (and common) that two data points have exactly same features and very different value of target variable (revenue). e.g first 4 data points have the same features and and only 4th has revenue. Ideally, my model would on test example with same features predict average revenue for those 4 clicks (which is 30).

My question is about data representation before I try to apply model. I believe I have two choices:

Apply regression directly to click data (like in case above) and hope that regression would do the right thing. In this case regression error would be pretty big on the end so it would be hard to tell how good the model actually is.
Try to group multiple data points (clicks) to one single point to avoid some zeros - group all data points that have the same features and calculate target (revenue) variable as SUM(revenue)/COUNT(clicks). With this approach I still have a lot of zeroes in revenue (products that got only few clicks) and sometimes there will be thousands of clicks that give you only one data point - which doesn't seem right.

How to proceed with this problem?

Solution

With 95% of your data having zero revenue, you may need to do something about the records, such as sampling. As currently constructed, your model could predict "no" 100% of the time and still be 95% accurate. You need to make a design choice about what type of error you'd like to have in your model. Would you like it to be "as accurate as possible", in that it misses the fewest possible records, to miss as few revenue records as possible, or avoid incorrectly classifying records as as revenue if they actually aren't (Read more on Type 1 & 2 error if you're curious)

There are a couple high level choices you could make:

1) You could over-sample your data. If you have a lot of records and want to make certain that you capture the revenue generating features, you can either duplicate those records or do some record engineering to create "fake" records that are very similar to those that generate revenue. This will increase the likelihood that your model catches on to what is driving revenue, and will make it overly likely to value those features when you apply it to real data

2) You could use a model to predict probabilities, and then scale your probabilities. For example, you may look at your model and say that anything with greater then 25% likelihood of being revenue generating as actually a "positive" case

3) You can try and cluster the data first, as you mentioned above, and try and run a classification algorithm on the "summed" values, rather than the individual records.

4) Are there some segments that hit with >5% likelihood? Maybe build a model on those subsets.

These are all model design choices and there is no right/wrong answer - it just depends on what you are trying to accomplish.

Edited per your response Regression can be significantly impacted by outliers, so I would be a bit careful just trying to use a regression to predict the dollar amounts. It's very likely that the majority of your variables will have small coefficients, and the intercept will reflect the average spend. The other thing you should keep in mind are the interaction terms. For example, you may be more likely to buy if you're male, and more likely if you're age 25-30, but being BOTH male and 25-30 has an outsized effect.

The reason I brought up classification was you could try and do a classification to see who is likely to buy, and then afterwards apply dollar amounts. That approach would prevent you from having essentially the same very small amount for every transaction.