I have a text file with multiple values, however, there is no delimiter to differentiate this when loaded into R. There is a secondary file that defines each of the columns based on a start and end position.
I tried to go through the solutions that already exist but could not get information on the range based delineation for multiple columns
The data looks like:
Column1
--------------------------------------------------------
00000000000102019000000000000000000049491000000000004CAD
00000000000102019000000000000000000049491000000000005CAP
00000000000102019000000000000000000049491000000000023GSP
00000000000102019000000000000000000049491000000000030MUD
The field range is defined as:
Field Name | Start | End
--------------------------
COL1 | 1 | 2
COL2 | 13 | 17
COL3 | 18 | 12
....
I have about 200,000 rows with having 55 columns each based on the range described above.
I am not sure how input multiple ranges to create a new dataframe with all the 55 columns split based on the start and end values.
Could anyone please assist me with this?
try the following code. Note that I made the code according with the data that you mention.
example <- c("00000000000102019000000000000000000049491000000000004CAD","00000000000102019000000000000000000049491000000000004CAD")
name <- c("COL1","COL2","COL3")
start <- c(1,13,18)
end <- c(2,17,22)
# save the second file for reference
range_df <- data.frame(Field_name=name,Start=start,End=end)
# Make a function that splits according the second file
split_cols <- function(string){
# Get the `n` rows of the second file for iteration
n <- nrow(range_df)
# Declare an empty `data.frame` for save the splitted data
cols <- data.frame(matrix(NA,1,n))
for(i in 1:n){
# Get the range of characters with `substr` function
# The start and end range is defined in the `range_df`
# column 2 is the `start` character and columns 3 the `end`
cols[,i] <- substr(string,range_df[i,2],range_df[i,3])
# save it in each column of the new data frame named `cols`
}
# Return the values of the `splitted` string as data.frame
return(cols)
}
# In order to apply the function above for each row you can either
# use `for` loop or apply function. In this case I used `lapply`
all_data <- lapply(example,split_cols)
# `lapply` makes the job done, however is in a `list` form. Yo can
# structure the data with do.call function
final_df <- do.call("rbind",all_data)
# Finally add the columns names from the secondary df
names(final_df) <- as.character(range_df[,1])
Of course this code can be improved considerably but this can get the job done.
Hope this can help