Search code examples
rlag

What to use instead of 'lag' if my data is not time series but just large set of text rows


My datafile is the merge of many text files with a similar structure:

  1. line "START OF NEW FILE ",
  2. line with the filename,
  3. 'data lines' with two columns: domain and number of events

The number of such 'data lines' per file may vary from 0 to hundred of thousands The goal is to add new column that for every row will indicate filename it relates to

Here is the data sample:

 DF <- structure(list(domain = c("START OF NEW FILE", "94_res.txt", 
"google.ru", "START OF NEW FILE", "95_res.txt", "search-results.com", 
"hpc.ru", "theadgateway.com", "google.by"), count = c(NA, NA, 
2L, NA, NA, 2L, 1L, 1L, 6L)), row.names = c(NA, -9L), class = "data.frame")

All my attempts to use lag or similar functions failed. The below example is my best attempt, but fills the new column based on current row value instead of previous row value.

transform(test.df,
          fnameRaw = ifelse(lag(test.df$domain, 1) == "START OF NEW FILE ",
                            test.df$domain,
                            ""))
                                             domain count           fnameRaw
1                                START OF NEW FILE        START OF NEW FILE 
2                                        94_res.txt                         
3                                         google.ru     2                   
4                                START OF NEW FILE        START OF NEW FILE 
5                                        95_res.txt                         
6                                search-results.com     2                   
7                                            hpc.ru     1                   
8                                  theadgateway.com     1                   
9                                         google.by     6  

Is it because my data is not real time series? Or because some tricks should be added to solve the problem with 1st row missing 'previous' row? Or something else?

P.S. The desired output is smth like this (fnameRaw is just intermediate field I keep for sake of comparison with actual output)

                domain  count     fnameRaw    filename
1     START OF NEW FILE                             N/A
2            94_res.txt           94_res.txt  94_res.txt
3             google.ru       2               94_res.txt
4     START OF NEW FILE                       94_res.txt
5            95_res.txt           95_res.txt  95_res.txt
6    search-results.com       2               95_res.txt
7                hpc.ru       1               95_res.txt
8      theadgateway.com       1               95_res.txt
9             google.by       6               95_res.txt

Solution

  • lag is intended for time series but if you use dplyr then there is a lag that works with data frames.

    1) dplyr/lag We can use lag like this:

    library(dplyr)
    library(zoo)
    
    DF %>% 
       mutate(filename = ifelse(lag(domain) == "START OF NEW FILE", domain, NA),
         filename = na.locf0(filename),
         filename = ifelse(domain == "START OF NEW FILE", NA, filename))
    

    giving:

                  domain count   filename
    1  START OF NEW FILE    NA       <NA>
    2         94_res.txt    NA 94_res.txt
    3          google.ru     2 94_res.txt
    4  START OF NEW FILE    NA       <NA>
    5         95_res.txt    NA 95_res.txt
    6 search-results.com     2 95_res.txt
    7             hpc.ru     1 95_res.txt
    8   theadgateway.com     1 95_res.txt
    9          google.by     6 95_res.txt
    

    2) dplyr/no lag Without lag we can group the rows using grouping variable g which is 1 for the rows starting at the first START OF NEW FILE, 2 for the second set of rows and so on.

    library(dplyr)
    
    DF %>%
       group_by(g = cumsum(domain == "START OF NEW FILE")) %>%
       mutate(filename = c(NA, rep(domain[2], n()-1))) %>%
       ungroup %>%
       select(-g)
    

    2a) Base R/no lag This base solution is similar to (2). Create a vector g with one element per row of DF whose elements are 1 starting with the first START OF NEW FILE row, 2 starting with the second and so on. Then define a function make_filename which creates the filename column for one group defined by g. Finally apply make_function to each group. No packages are used.

    g <- cumsum(DF$domain == "START OF NEW FILE")
    make_Filename <- function(x) c(NA, rep(x[2], length(x) - 1))
    transform(DF, filename = ave(DF$domain, g, FUN = make_filename))
    

    Note

    The input DF in reproducible form is:

    DF <- structure(list(domain = c("START OF NEW FILE", "94_res.txt", 
    "google.ru", "START OF NEW FILE", "95_res.txt", "search-results.com", 
    "hpc.ru", "theadgateway.com", "google.by"), count = c(NA, NA, 
    2L, NA, NA, 2L, 1L, 1L, 6L)), row.names = c(NA, -9L), class = "data.frame")