Search code examples
rmachine-learningtidyversemodeling

Feature Engineering: extracting the effect of email newsletter


Im quite new to machine learning and applied modeling. Currently iam working on a forecast project and collect various data for my features. I often times read that only choosing standalone features is not enough and that you want to extract new features based on existing ones.

Imagine a company send out a new years newsletter and i have the following tibble with columns date and mail, where 1 in the mail column stands for newsletter was send and 0 stands for no newsletter was send

library(tidyverse)

tibble <- tibble(date=as_date(1:31, origin="2019-12-31"),
                 mail=factor(c(1, rep(0,30))))
# A tibble: 31 x 2
   date       mail 
   <date>     <fct>
 1 2020-01-01 1    
 2 2020-01-02 0    
 3 2020-01-03 0    
 4 2020-01-04 0    
 5 2020-01-05 0    
 6 2020-01-06 0    
 7 2020-01-07 0    
 8 2020-01-08 0    
 9 2020-01-09 0    
10 2020-01-10 0    
# ... with 21 more rows

Based on the mail feature i want to build a new feature that represets kind of a lagged mail effect since customers neither instantly check their mails nor not instantly visit the shop rather or consequently buy something. So the effect may be there for 4 or 5 days.

I simply could add 1 to the following 4 dates. But i cannot imagine that this would be best practice. So my question is what the best practice would be to model such effects.

Any suggestions are appreciated


Solution

  • if I understand you correctly, you want a column (feature) that represents whether the newsletter was sent at anytime in the last 4 days? The lag function can help:

    tibble %>%
      mutate(mail_lagged = 
               as.numeric(
                 mail == 1 |
                 lag(mail, 1) == 1 |
                 lag(mail, 2) == 1 |
                 lag(mail, 3) == 1 |
                 lag(mail, 4) == 1))
    
    # A tibble: 31 x 3
       date       mail  mail_lagged
       <date>     <fct>       <dbl>
     1 2020-01-01 1               1
     2 2020-01-02 0               1
     3 2020-01-03 0               1
     4 2020-01-04 0               1
     5 2020-01-05 0               1
     6 2020-01-06 0               0
     7 2020-01-07 0               0
     8 2020-01-08 0               0
     9 2020-01-09 0               0
    10 2020-01-10 0               0
    # … with 21 more rows