long time answer-seeker, first time question-asker. I have an R data frame that is a single column, 267,000 rows with 17 factors, like so:
regions
VE
PU
PR
DE
NU
AD
DE
NO
AD
I'm attempting to extract these as column sequences, with lengths of 2 and 3, then move down by 1 row and repeat until the end. repeats and order present. I want to take the above, and make it look like this:
s1 s2
VE PU
PU PR
PR DE
DE NU
NU AD
AD DE
DE NO
I've tried using packages like TraMinEr and ArulesSequences, but I can't figure them out. I think it's because my sequences are purely states, there's no temporal information attached, not even in the source dataset. I also tried making my own iterator scripts, but I couldn't successfully. I've googled endlessly, and I'm just at wits end. I don't know how to do this. the eventual goal is to match the outputs with a 2 or 3 permutation data frame, and binarize matches with a 1, 0 for no matches, and process that x49 into a new data frame.
I'm no expert in programming or R, just a novice user. does anyone know a script or package that can do this?
What you basically want to do is assign regions
without the last observation to s1
and regions
without the first observation to s2
. You don't necessarily need extra packages for that. There are several approaches:
1) Using the head
and tail
functions
With these you can get vectors without the last observation (head(column, -1)
) or without the first observation (tail(column, -1)
).
Using:
new.df <- data.frame(s1 = head(df$regions,-1), s2 = tail(df$regions,-1))
will thus get you:
> new.df s1 s2 1 VE PU 2 PU PR 3 PR DE 4 DE NU 5 NU AD 6 AD DE 7 DE NO 8 NO AD
If you want three columns, you could do:
new.df <- data.frame(s1 = head(df$regions,-2),
s2 = head(tail(df$regions,-1),-1),
s3 = tail(df$regions,-2))
which results in:
> new.df s1 s2 s3 1 VE PU PR 2 PU PR DE 3 PR DE NU 4 DE NU AD 5 NU AD DE 6 AD DE NO 7 DE NO AD
2) basic subsetting
As an alternative to the head
and tail
functions, you could also use basic subsetting:
new.df <- data.frame(s1 = df$regions[-nrow(df)],
s2 = df$regions[-1])
3) using the embed
-function
n <- 3
new.df <- data.frame(embed(df$regions, n)[,n:1])
names(new.df) <- paste0('s',1:n)
which gives:
> new.df s1 s2 s3 1 VE PU PR 2 PU PR DE 3 PR DE NU 4 DE NU AD 5 NU AD DE 6 AD DE NO 7 DE NO AD
4) using the shift
-function from the data.table
-package
The shift
function from the data.table
package might also be an option:
library(data.table)
dt <- as.data.table(df)
new.dt <- na.omit(dt[, .(s1 = regions,
s2 = shift(regions, 1, NA, 'lead'),
s3 = shift(regions, 2, NA, 'lead'))])
And instead of na.omit
, you could also use rowSums
on is.na
:
new.dt <- dt[, .(s1 = regions,
s2 = shift(regions, 1, NA, 'lead'),
s3 = shift(regions, 2, NA, 'lead'))]
new.dt[rowSums(is.na(new.dt))==0]