Search code examples
rsubstringstrsplit

How to split a single column into multiple column using range (Start-End) values in R?


I have a text file with multiple values, however, there is no delimiter to differentiate this when loaded into R. There is a secondary file that defines each of the columns based on a start and end position.

I tried to go through the solutions that already exist but could not get information on the range based delineation for multiple columns

The data looks like:

    Column1                                                 
    --------------------------------------------------------
    00000000000102019000000000000000000049491000000000004CAD   
    00000000000102019000000000000000000049491000000000005CAP    
    00000000000102019000000000000000000049491000000000023GSP  
    00000000000102019000000000000000000049491000000000030MUD

The field range is defined as:

    Field Name   | Start | End 
    --------------------------
     COL1         | 1     | 2
     COL2         | 13    | 17
     COL3         | 18    | 12
     ....

I have about 200,000 rows with having 55 columns each based on the range described above.

I am not sure how input multiple ranges to create a new dataframe with all the 55 columns split based on the start and end values.

Could anyone please assist me with this?


Solution

  • try the following code. Note that I made the code according with the data that you mention.

        example <- c("00000000000102019000000000000000000049491000000000004CAD","00000000000102019000000000000000000049491000000000004CAD")
    
        name <- c("COL1","COL2","COL3")
        start <- c(1,13,18)
        end <- c(2,17,22)
    
        # save the second file for reference
        range_df <- data.frame(Field_name=name,Start=start,End=end)
    
        # Make a function that splits according the second file
        split_cols <- function(string){
          # Get the `n` rows of the second file for iteration
          n <- nrow(range_df)
          # Declare an empty `data.frame` for save the splitted data
          cols <- data.frame(matrix(NA,1,n))
          for(i in 1:n){
            # Get the range of characters with `substr` function
            # The start and end range is defined in the `range_df`
            # column 2 is the `start` character and columns 3 the `end`
            cols[,i] <- substr(string,range_df[i,2],range_df[i,3])
            # save it in each column of the new data frame named `cols`
          }
          # Return the values of the `splitted` string as data.frame 
          return(cols)
        }
    
        # In order to apply the function above for each row you can either 
        # use `for` loop or apply function. In this case I used `lapply`
        all_data <- lapply(example,split_cols)
    
    
        # `lapply` makes the job done, however is in a `list` form. Yo can 
        # structure the data with do.call function
        final_df <- do.call("rbind",all_data)
    
        # Finally add the columns names from the secondary df
        names(final_df) <- as.character(range_df[,1])
    

    Of course this code can be improved considerably but this can get the job done.

    Hope this can help