My datafile is the merge of many text files with a similar structure:
The number of such 'data lines' per file may vary from 0 to hundred of thousands The goal is to add new column that for every row will indicate filename it relates to
Here is the data sample:
DF <- structure(list(domain = c("START OF NEW FILE", "94_res.txt",
"google.ru", "START OF NEW FILE", "95_res.txt", "search-results.com",
"hpc.ru", "theadgateway.com", "google.by"), count = c(NA, NA,
2L, NA, NA, 2L, 1L, 1L, 6L)), row.names = c(NA, -9L), class = "data.frame")
All my attempts to use lag
or similar functions failed. The below example is my best attempt, but fills the new column based on current row value instead of previous row value.
transform(test.df,
fnameRaw = ifelse(lag(test.df$domain, 1) == "START OF NEW FILE ",
test.df$domain,
""))
domain count fnameRaw
1 START OF NEW FILE START OF NEW FILE
2 94_res.txt
3 google.ru 2
4 START OF NEW FILE START OF NEW FILE
5 95_res.txt
6 search-results.com 2
7 hpc.ru 1
8 theadgateway.com 1
9 google.by 6
Is it because my data is not real time series? Or because some tricks should be added to solve the problem with 1st row missing 'previous' row? Or something else?
P.S. The desired output is smth like this (fnameRaw is just intermediate field I keep for sake of comparison with actual output)
domain count fnameRaw filename
1 START OF NEW FILE N/A
2 94_res.txt 94_res.txt 94_res.txt
3 google.ru 2 94_res.txt
4 START OF NEW FILE 94_res.txt
5 95_res.txt 95_res.txt 95_res.txt
6 search-results.com 2 95_res.txt
7 hpc.ru 1 95_res.txt
8 theadgateway.com 1 95_res.txt
9 google.by 6 95_res.txt
lag
is intended for time series but if you use dplyr then there is a lag
that works with data frames.
1) dplyr/lag We can use lag
like this:
library(dplyr)
library(zoo)
DF %>%
mutate(filename = ifelse(lag(domain) == "START OF NEW FILE", domain, NA),
filename = na.locf0(filename),
filename = ifelse(domain == "START OF NEW FILE", NA, filename))
giving:
domain count filename
1 START OF NEW FILE NA <NA>
2 94_res.txt NA 94_res.txt
3 google.ru 2 94_res.txt
4 START OF NEW FILE NA <NA>
5 95_res.txt NA 95_res.txt
6 search-results.com 2 95_res.txt
7 hpc.ru 1 95_res.txt
8 theadgateway.com 1 95_res.txt
9 google.by 6 95_res.txt
2) dplyr/no lag Without lag
we can group the rows using grouping variable g
which is 1 for the rows starting at the first START OF NEW FILE
, 2 for the second set of rows and so on.
library(dplyr)
DF %>%
group_by(g = cumsum(domain == "START OF NEW FILE")) %>%
mutate(filename = c(NA, rep(domain[2], n()-1))) %>%
ungroup %>%
select(-g)
2a) Base R/no lag This base solution is similar to (2). Create a vector g
with one element per row of DF
whose elements are 1 starting with the first START OF NEW FILE
row, 2 starting with the second and so on. Then define a function make_filename
which creates the filename
column for one group defined by g
. Finally apply make_function
to each group. No packages are used.
g <- cumsum(DF$domain == "START OF NEW FILE")
make_Filename <- function(x) c(NA, rep(x[2], length(x) - 1))
transform(DF, filename = ave(DF$domain, g, FUN = make_filename))
The input DF
in reproducible form is:
DF <- structure(list(domain = c("START OF NEW FILE", "94_res.txt",
"google.ru", "START OF NEW FILE", "95_res.txt", "search-results.com",
"hpc.ru", "theadgateway.com", "google.by"), count = c(NA, NA,
2L, NA, NA, 2L, 1L, 1L, 6L)), row.names = c(NA, -9L), class = "data.frame")