Search code examples
rdataframevectorizationsubstringvariable-length

Creating a dataframe by concatenating substrings of non-uniform lengths


The raw data (.txt) file has 65926 elements, each containing a 142 character string.

Here's the dropbox link to the raw data file.

The task is to divide that 142 character string into 37 smaller strings (each substring being a separate character variable), each one beginning sequentially at following positions: 1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,56,60,66,72,75,76,77,78,79,80,127,130,133

The final output must be a 65926 x 37 dataframe. Here's a snapshot of the output dataframe:

Here's a snapshot of the output dataframe.

Here's the code I'm using:

x <- readLines("R71252L01.TXT")
a <- c(1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,56,60,66,72,75,76,77,78,79,80,127,130,133)
z <- data.frame(matrix(nrow = length(x), ncol = length(a)), stringsAsFactors = FALSE)
for (i in 1:length(x) ) {
z[i,] <- (list(
            (c(substr(x[i], 1, 3),substr(x[i], 4, 8),substr(x[i], 9, 10),
               substr(x[i], 11, 13),substr(x[i], 14, 14),substr(x[i], 15, 15),
               substr(x[i], 16, 18),substr(x[i], 19, 20),substr(x[i], 21, 23),
               substr(x[i], 24, 25),substr(x[i], 26, 26),substr(x[i], 27, 27),
               substr(x[i], 28, 31),substr(x[i], 32, 32),substr(x[i], 33, 33),
               substr(x[i], 34, 35),substr(x[i], 36, 37),substr(x[i], 38, 42),
               substr(x[i], 43, 44),substr(x[i], 45, 45),substr(x[i], 46, 46),
               substr(x[i], 47, 47),substr(x[i], 48, 51),substr(x[i], 52, 55),
               substr(x[i], 56, 56),substr(x[i], 60, 65),substr(x[i], 66, 71),
               substr(x[i], 72, 74),substr(x[i], 75, 75),substr(x[i], 76, 76),
               substr(x[i], 77, 77),substr(x[i], 78, 78),substr(x[i], 79, 79),
               substr(x[i], 80, 126),substr(x[i], 127, 129),substr(x[i], 130, 132),
               substr(x[i], 133, 142)
            ) )
        )    )
i <- i+1
}

The code works but there are two issues:

  1. The start and stop indices for substr() have to be manually typed out. Any way to utilize the vector a instead of all that manual labour?
  2. The code takes over 30 minutes to execute. Checking time:

    > system.time(source('Hitesh_Script.R'))
        user   system  elapsed 
    4452.464    9.440 4476.018 
    

    Can this be done faster?

I have to perform this task for several raw data files, each with a different vector a. Hence any other suggestions for efficiency will also be appreciated. Thanks a ton!


Solution

  • I just simulated a scratch data and shared a simple code

    a <- c(1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,
           56,60,66,72,75,76,77,78,79,80,127,130,133)
    df = data.frame(
      x = c("uiagdsjgcjkh bijacydgasxdhsfkajdh,cnfwkeyrg,urnyhvguirwljbhgkjgjgdkgkdgkgdkgdkgdkgdkgdkgdkgdkdgkdgkdgkdgkjdgkdwjbiuayeiuy ke,ynh lgdiyl", 
            "kjhfkjsdlfkojjhgckjasnhjhckjsybsanhdsabtgchbtsjahasijhcndkuysefiuwyhsnidxjnkausetfba jwf,mycaiusftbbawubbctejdgkjdghjgdduiacwyftacbm"))
    

    # > df
    # x
    # 1 uiagdsjgcjkh bijacydgasxdhsfkajdh,cnfwkeyrg,urnyhvguirwljbhgkjgjgdkgkdgkgdkgdkgdkgdkgdkgdkgdkdgkdgkdgkdgkjdgkdwjbiuayeiuy ke,ynh lgdiyl
    # 2 kjhfkjsdlfkojjhgckjasnhjhckjsybsanhdsabtgchbtsjahasijhcndkuysefiuwyhsnidxjnkausetfba jwf,mycaiusftbbawubbctejdgkjdghjgdduiacwyftacbm
    

    df1 <- data.frame(lapply(df, function(x) lapply(seq_along(a), function(i){
                                                  if (i==length(a))
                                                    substr(x,a[i],nchar(as.character(x)))
                                                  else
                                                    substr(x,a[i],a[i+1]-1)}
                                            )))
    colnames(df1)=paste0("x",1:dim(df1)[2])
    
    df1
    #    x1    x2 x3  x4 x5 x6  x7 x8  x9 x10 x11 x12  x13 x14 x15 x16 x17   x18 x19 x20 x21
    # 1 uia gdsjg cj kh   b  i jac yd gas  xd   h   s fkaj   d   h  ,c  nf wkeyr  g,   u   r
    # 2 kjh fkjsd lf koj  j  h gck ja snh  jh   c   k jsyb   s   a  nh  ds abtgc  hb   t   s
    
    #   x22  x23  x24  x25    x26    x27 x28 x29 x30 x31 x32 x33
    # 1   n yhvg uirw ljbh gkjgjg dkgkdg kgd   k   g   d   k   g
    # 2   j ahas ijhc ndku ysefiu wyhsni dxj   n   k   a   u   s
    
    #                                               x34 x35 x36 x37
    # 1 dkgdkgdkgdkgdkdgkdgkdgkdgkjdgkdwjbiuayeiuy ke,y nh  lgd iyl
    # 2 etfba jwf,mycaiusftbbawubbctejdgkjdghjgdduiacwy fta cbm