I am using the data.table package to work with a very large data set, and value it's speed and clarity. But I am new to it and am having difficulties chaining functions together especially when working with a mixed set of data.table and base R functions. My question is, how do I chain together the below example functions, into one seamless string of code for defining the target data
object?
Below is the correct output, generated by running each line of code separately (unchained) with the generating code shown immediately beneath the output:
> data
ID Period State Values
1: 1 1 X0 5
2: 1 2 X1 0
3: 1 3 X2 0
4: 1 4 X1 0
5: 2 1 X0 1
6: 2 2 XX 0
7: 2 3 XX 0
8: 2 4 XX 0
9: 3 1 X2 0
10: 3 2 X1 0
11: 3 3 X9 0
12: 3 4 X3 0
13: 4 1 X2 1
14: 4 2 X1 2
15: 4 3 X9 3
16: 4 4 XX 0
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
)
# changes State to "XX" if remaining Values_1 + Values_2 cumulative sums = 0 for each ID:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID]
# create new column "Values", which equals "Values_1":
setDT(data)[,Values := Values_1]
# in base R, drops columns Values_1 and Values_2:
data <- subset(data, select = -c(Values_1,Values_2)) # How to do this step in data.table, if possible or advisable?
# in base R, changes all "XX" elements in State column to "HI":
data$State <- gsub('XX','HI', data$State) # How to do this step in data.table, if possible or advisable?
For what it's worth, below is my attempt to chain together using '%>%' pipe operators, which fails (error message Error in data$State : object of type 'closure' is not subsettable), and though I'd rather chain together using data.table operators:
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
setDT(data)[,Values := Values_1] %>%
subset(data, select = -c(Values_1,Values_2)) %>%
data$State <- gsub('XX','HI', data$State)
If I understand correctly, the OP wants to
Value_1
to Value
(or in OP's words: create new column "Values", which equals "Values_1")Value_2
XX
by HI
in column State
Here is what I would do in data.table syntax:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
, Values_2 := NULL][
State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
ID Period Values State 1: 1 1 5 X0 2: 1 2 0 X1 3: 1 3 0 X2 4: 1 4 0 X1 5: 2 1 1 X0 6: 2 2 0 HI 7: 2 3 0 HI 8: 2 4 0 HI 9: 3 1 0 X2 10: 3 2 0 X1 11: 3 3 0 X9 12: 3 4 0 X3 13: 4 1 1 X2 14: 4 2 2 X1 15: 4 3 3 X9 16: 4 4 0 HI
setnames()
updates by reference, e.g., without copying. There is no need to create a copy of Values_1
and delete Values_1
later on.
Also, [State == "XX", State := "HI"]
replaces XX
by HI
only in affected rows by reference while
[, State := gsub('XX','HI', State)]
replaces the whole column.
data.table chaining is used where appropriate.
BTW: I wonder why the replacement of XX
by HI
cannot be done rightaway in the first statement:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
, Values_2 := NULL][]
setnames(data, "Values_1", "Values")