Im quite new to machine learning and applied modeling. Currently iam working on a forecast project and collect various data for my features. I often times read that only choosing standalone features is not enough and that you want to extract new features based on existing ones.
Imagine a company send out a new years newsletter and i have the following tibble with columns date and mail, where 1 in the mail column stands for newsletter was send and 0 stands for no newsletter was send
library(tidyverse)
tibble <- tibble(date=as_date(1:31, origin="2019-12-31"),
mail=factor(c(1, rep(0,30))))
# A tibble: 31 x 2
date mail
<date> <fct>
1 2020-01-01 1
2 2020-01-02 0
3 2020-01-03 0
4 2020-01-04 0
5 2020-01-05 0
6 2020-01-06 0
7 2020-01-07 0
8 2020-01-08 0
9 2020-01-09 0
10 2020-01-10 0
# ... with 21 more rows
Based on the mail feature i want to build a new feature that represets kind of a lagged mail effect since customers neither instantly check their mails nor not instantly visit the shop rather or consequently buy something. So the effect may be there for 4 or 5 days.
I simply could add 1
to the following 4 dates. But i cannot imagine that this would be best practice. So my question is what the best practice would be to model such effects.
Any suggestions are appreciated
if I understand you correctly, you want a column (feature) that represents whether the newsletter was sent at anytime in the last 4 days? The lag
function can help:
tibble %>%
mutate(mail_lagged =
as.numeric(
mail == 1 |
lag(mail, 1) == 1 |
lag(mail, 2) == 1 |
lag(mail, 3) == 1 |
lag(mail, 4) == 1))
# A tibble: 31 x 3
date mail mail_lagged
<date> <fct> <dbl>
1 2020-01-01 1 1
2 2020-01-02 0 1
3 2020-01-03 0 1
4 2020-01-04 0 1
5 2020-01-05 0 1
6 2020-01-06 0 0
7 2020-01-07 0 0
8 2020-01-08 0 0
9 2020-01-09 0 0
10 2020-01-10 0 0
# … with 21 more rows