Search code examples
rfunctionsubstr

In R, How to split a datset into columns based on variable start/stop positions?


I would like to split rows of a dataset based on many pairs of start/stop positions within the row

I can do this with the command substr for each one in longhand fashion but this seems like a poor choice.

I actually have 7 datasets for which I need to do this, I am hoping there is a way of defining an array/vector with the start/stop pairs and then feed this to the substr command.

Any help guidance would be awesome

# I have a datset which contains records like this

string1 <- "08103128827DP        11        513452                  131 Markett Hills Rd Coolingford XYZ 9876                                       14602012476          000000000000000000010784Y00000000000053815"
string2 <- "08203143982DP        12        611218                  12 Magicra Waters Rd Huntley XXX 9081                                           14602012476          000000000000000000010784Y00000000000038443"

# Make a dummy datset
V1 <- c(string1, string2)
myData <- data.frame(V1)
head(myData)

# I would like to split into distinct fields for each row of the typically large dataset that I have
fld_1 <- substr(myData, 0, 3)
fld_2 <- substr(myData, 4, 11)
fld_3 <- substr(myData, 12, 16)
fld_4 <- as.numeric(substr(myData, 187, 198))/100

# The field widths vary, as do the data types

Solution

  • Assuming you have exact starting and ending positions:

    # (0) Make a dummy dataset
    string1 <- "08103128827DP        11        513452                  131 Markett Hills Rd Coolingford XYZ 9876                                       14602012476          000000000000000000010784Y00000000000053815"
    string2 <- "08203143982DP        12        611218                  12 Magicra Waters Rd Huntley XXX 9081                                           14602012476          000000000000000000010784Y00000000000038443"
    V1 <- c(string1, string2)
    
    # (1) Define positions and variable names
    pos <- list("Var 1" = c(0, 13),
                "Var 2" = c(22, 23),
                "Var 3" = c(32, 37))
    
    # (2) Extract variables as text
    vars <- lapply(pos, function(x) {
      substr(V1, x[1], x[2])
    })
    
    # (3) Assign classes
    class(vars[["Var 2"]]) <- "numeric"
    class(vars[["Var 3"]]) <- "numeric"