I am struggling with this issue that I am unable to fix! I have to data-frames ("Master" and " "Hours"). The 'Master' df has many columns but the ones in particular are as follows below:
Master
StoreNumber ... MON TUE WED THU FRI SAT SUN
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
...
NB: The Master df has many columns in between StoreNumber as the days of the week and holds a lot of data (about 3000 stores)
Hours
BranchNumber Day TimeDiff
1 MON 7.50
1 TUE 6.00
1 WED 8.50
1 THU 2.00
1 FRI 1.00
1 SAT 2.50
3 MON 7.50
3 TUE 6.00
3 WED 8.50
3 THU 2.00
3 FRI 1.00
3 SAT 2.50
3 SUN 5.00
...
So the idea is that I am trying to match the 'Hours' BrandNumber with the 'Master' StoreNumber. Once there is a match then it matched the Day column from the 'Hours' table with the Days of the week in the 'Masters' Table...It will do this for each row and then populate days of the week with the corresponding value in 'TimeDiff' column...if the store and branch number has no match (Like StoreNumber 2) then it should skip that row and move onto the next. Another condition, like BranchNumber '1' there is no data for SUNDAY so in the 'Master' table the SUNDAY cell should be left as 0...this should work for any day of the week.
The output should be the 'Master' Table but complete with all the days of the week data from the 'Hours' Table. In this example, it should look like:
StoreNumber ... MON TUE WED THU FRI SAT SUN
1 7.50 6.00 8.50 2.00 1.00 2.50 0
2 0 0 0 0 0 0 0
3 7.50 6.00 8.50 2.00 1.00 2.50 5.00
...
The code I have tried is semi-working but I am not sure if its the correct approach. The biggest problem I am getting is that its duplicating the rows expect from the first row. For instance, the output looks more like this.
StoreNumber
1
2
2
3
3
4
4
5
5
5
all are duplicating and some tripling and every 87 columns are identical...however the days of the week of the duplicated row are all 0's.
merged <- Master %>% select(-c("MON","TUE","WED","THU","FRI","SAT","SUN")) %>%
left_join(
Hours %>% pivot_wider(names_from = Day, values_from = TimeDiff),
by = c('StoreNumber' = 'BranchNumber'))
merged <- merged %>% replace(is.na(.),0)
Sorry for the long question, this issue has been bugging me a while so any help/advice will be grateful
If I understand correctly, the Master
table has many columns and only the columns MON
to SUN
need to be updated.
Here are two approaches which uses data.table
's ability to update in a join. Only the relevant columns are modified by reference, i.e., without copying the whole data object. It avoids to reshape (or pivot) the Master
table forth and back.
library(data.table)
days <- names(Master)[which(names(Master) == "MON") + (0:6)]
setDT(Master)[, (days) := lapply(.SD, as.double), .SDcols = days]
for (d in days) {
Master[Hours, on =.(StoreNumber = BranchNumber), (d) := TimeDiff[d == Day], by = .EACHI]
}
Master[]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN 1: 1 a 7.5 6 8.5 2 1 2.5 0 2: 2 b 0.0 0 0.0 0 0 0.0 0 3: 3 c 7.5 6 8.5 2 1 2.5 5
days
contains the names of the columns.days <- names(Master)[which(names(Master) == "MON") + (0:6)]
is equivalent todays <- c("MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN")
data.table
requires consistent data types when parts of a column are updated. The day columns in Master
are initialized to integer zero but TimeDiff
in Hours
is numeric. Therefore, the day columns in Master
are coerced to double before updating.for
loop iterates over each day column and performs the update join for this column. For each match (by = .EACHI
), the Timediff
for the relevant day is picked.In order to verify that Master
has not been copied we can call
data.table::address(Master)
before and after the operation: The address of Master
has not changed.
This approach is a bit leaner. It also uses an update join but it is different to variant 1 as it reshapes (or pivots) Hours
from long to wide format and removes the days columns from Master
instead of coercing a bunch of integer zeroes to type numeric:
library(data.table)
days <- c("MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN")
Hours_wide <- dcast(setDT(Hours)[, Day := ordered(Day, levels = days)], BranchNumber ~ Day)
setDT(Master)[, (days) := NULL][
Hours_wide, on = .(StoreNumber = BranchNumber), (days) := mget(paste0("i.", days))]
Master[]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN 1: 1 a 7.5 6 8.5 2 1 2.5 NA 2: 2 b NA NA NA NA NA NA NA 3: 3 c 7.5 6 8.5 2 1 2.5 5
Note that missing elements are now initialized to / indicated by NA
which is much easier to detect, IMHO. If required, the NA
s can be turned into another numeric value by
Master[, (days) := lapply(.SD, nafill, fill = 0), .SDcols = days][]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN 1: 1 a 7.5 6 8.5 2 1 2.5 0 2: 2 b 0.0 0 0.0 0 0 0.0 0 3: 3 c 7.5 6 8.5 2 1 2.5 5
This approach uses mget(paste0("i.", days))
to pick the days columns from Hours
. If there are columns with the same name in both data.tables in a join, we can distinguish the columns by prepending the column names by a x.
and i.
, resp. Thus, x.MON
refers to the MON
column from the first data.table which is Master
in this case and i.MON
refers to the MON
column from the second data.table which is Hours_wide
. mget()
takes the column names as character strings and returns a list of the values of the respective columns.
Above code can be simplified by
setDT(Master)[, (days) := NULL][
Hours_wide, on = .(StoreNumber = BranchNumber), (days) := mget(days)][]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN 1: 1 a 7.5 6 8.5 2 1 2.5 NA 2: 2 b NA NA NA NA NA NA NA 3: 3 c 7.5 6 8.5 2 1 2.5 5
Because setDT(Master)[, (days) := NULL]
already has removed the columns MON
to SUN
from Master
there is no ambiguity on column names. Thus, the column names MON
to SUN
can be used without prepending them by i.
as the only columns named MON
to SUN
are in Hours_wide
.
With development version 1.14.1 as of 2021-05-10, a new interface for programming on data.table has been added (see item 10 in NEWS and the new vignette programming on data.table). Instead of get()
/ mget()
the new env
argument is recommended:
library(data.table) # development version 1.14.1 used
days <- c("MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN")
Hours_wide <- dcast(setDT(Hours)[, Day := ordered(Day, levels = days)], BranchNumber ~ Day)
setDT(Master)[, (days) := NULL][
Hours_wide, on = .(StoreNumber = BranchNumber), (days) := s,
env = list(s = as.list(days))][]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN 1: 1 a 7.5 6 8.5 2 1 2.5 NA 2: 2 b NA NA NA NA NA NA NA 3: 3 c 7.5 6 8.5 2 1 2.5 5
env
parameter and fcoalesce()
OP's expected result shows 0
instead of NA
. With the Variants 2 above, this was was achieved by a separate update step using nafill()
.
This separate update step can be avoided by using the fcoalesce()
function in the update join:
library(data.table) # development version 1.14.1 used
days <- c("MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN")
Hours_wide <- dcast(setDT(Hours)[, Day := ordered(Day, levels = days)], BranchNumber ~ Day)
setDT(Master)[, (days) := lapply(.SD, as.double), .SDcols = days][
Hours_wide, on = .(StoreNumber = BranchNumber), (days) := lapply(s, fcoalesce, 0),
env = list(s = as.list(paste0("i.", days)))][]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN 1: 1 a 7.5 6 8.5 2 1 2.5 0 2: 2 b 0.0 0 0.0 0 0 0.0 0 3: 3 c 7.5 6 8.5 2 1 2.5 5
library(data.table)
Master <- fread("
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN
1 a 0 0 0 0 0 0 0
2 b 0 0 0 0 0 0 0
3 c 0 0 0 0 0 0 0
", data.table = FALSE)
Hours <- fread("
BranchNumber Day TimeDiff
1 MON 7.50
1 TUE 6.00
1 WED 8.50
1 THU 2.00
1 FRI 1.00
1 SAT 2.50
3 MON 7.50
3 TUE 6.00
3 WED 8.50
3 THU 2.00
3 FRI 1.00
3 SAT 2.50
3 SUN 5.00
", data.table = FALSE)