Search code examples
time-seriesdata-scienceforecastinglagfeature-engineering

How to properly aggregate the social media post of a certain product theme dataset with the sales dataset for the specified theme?


I have a social media post dataset: df1 containing columns ['Date', 'total_post', 'Theme_ID', 'Theme Name','year', 'month'] and sales dataset: df2 containing columns ['Date', 'product_id' 'sales_dollars_value', 'sales_units_value', 'sales_lbs_value', 'Theme_ID', 'Theme Name','Vendor', 'year', 'month']. Now since sales of my product/theme will depend on the post on social media because of its advertisement how to merge these two datasets. I can merge it directly on Date and Theme_ID/Theme Name but my question is won't the effect of my social media post will get reflected in the value of my sales after some time. So how to include this as a lag feature?


Solution

  • You could calculate the cross correlation and determine which is best lag to consider.

    This is the idea in general: https://en.wikipedia.org/wiki/Cross-correlation

    This could be an implementation in python: Cross-correlation (time-lag-correlation) with pandas?