Mining sequences from data frame rows

long time answer-seeker, first time question-asker. I have an R data frame that is a single column, 267,000 rows with 17 factors, like so:

regions
VE
PU
PR
DE
NU
AD
DE
NO
AD

I'm attempting to extract these as column sequences, with lengths of 2 and 3, then move down by 1 row and repeat until the end. repeats and order present. I want to take the above, and make it look like this:

s1   s2
VE   PU
PU   PR
PR   DE
DE   NU
NU   AD
AD   DE
DE   NO

I've tried using packages like TraMinEr and ArulesSequences, but I can't figure them out. I think it's because my sequences are purely states, there's no temporal information attached, not even in the source dataset. I also tried making my own iterator scripts, but I couldn't successfully. I've googled endlessly, and I'm just at wits end. I don't know how to do this. the eventual goal is to match the outputs with a 2 or 3 permutation data frame, and binarize matches with a 1, 0 for no matches, and process that x49 into a new data frame.

I'm no expert in programming or R, just a novice user. does anyone know a script or package that can do this?

Solution

What you basically want to do is assign regions without the last observation to s1 and regions without the first observation to s2. You don't necessarily need extra packages for that. There are several approaches:

1) Using the head and tail functions

With these you can get vectors without the last observation (head(column, -1)) or without the first observation (tail(column, -1)).

Using:

new.df <- data.frame(s1 = head(df$regions,-1), s2 = tail(df$regions,-1))

will thus get you:

> new.df
  s1 s2
1 VE PU
2 PU PR
3 PR DE
4 DE NU
5 NU AD
6 AD DE
7 DE NO
8 NO AD

If you want three columns, you could do:

new.df <- data.frame(s1 = head(df$regions,-2), 
                     s2 = head(tail(df$regions,-1),-1),
                     s3 = tail(df$regions,-2))

which results in:

> new.df
  s1 s2 s3
1 VE PU PR
2 PU PR DE
3 PR DE NU
4 DE NU AD
5 NU AD DE
6 AD DE NO
7 DE NO AD

2) basic subsetting

As an alternative to the head and tail functions, you could also use basic subsetting:

new.df <- data.frame(s1 = df$regions[-nrow(df)], 
                     s2 = df$regions[-1])

3) using the embed-function

n <- 3
new.df <- data.frame(embed(df$regions, n)[,n:1])
names(new.df) <- paste0('s',1:n)

which gives:

> new.df
  s1 s2 s3
1 VE PU PR
2 PU PR DE
3 PR DE NU
4 DE NU AD
5 NU AD DE
6 AD DE NO
7 DE NO AD

4) using the shift-function from the data.table-package

The shift function from the data.table package might also be an option:

library(data.table)
dt <- as.data.table(df)
new.dt <- na.omit(dt[, .(s1 = regions,
                         s2 = shift(regions, 1, NA, 'lead'),
                         s3 = shift(regions, 2, NA, 'lead'))])

And instead of na.omit, you could also use rowSums on is.na:

new.dt <- dt[, .(s1 = regions,
                 s2 = shift(regions, 1, NA, 'lead'),
                 s3 = shift(regions, 2, NA, 'lead'))]

new.dt[rowSums(is.na(new.dt))==0]