Search code examples
rvectorsapplymapply

Truncation and merging values in two character vectors


I have a character vector V1

V1 <- c("377 Peninsula St. Ogden,UT","8532 West Lyme St. Chesterfield, 
VA","43 E. Hilltop Street Hilliard,OH","95 Newcastle St. 
Hendersonville,NC","7276 Rose St. Greenville,NC")

and another vector as V2

V2 <- c(84404,23832,43026,28792,27834)

Now I have these conditions:

1) Break each item in V1 at 24th character:


a) If 24th character is a comma then break the string there and remaining should be added to corresponding string in V2. e.g. V1 has "377 Peninsula St. Ogden, UT", wherein we have comma at 24th index thus we need to break this in two "377 Peninsula St. Ogden" "UT" (mind that comma itself is omitted) and then V1 gets "377 Peninsula St. Ogden" part and remaining is added to corresponding PIN in V2 thus "84404" in V2 becomes "UT 84404"

b) If 24th character is non-comma and non-whitespace find out last whitespace before comma in V1 and upto that index V1 keeps, remaining goes to V2. e.g. V1 has "8532 West Lyme St. Chesterfield, VA", wherein we have "t" at 24th index thus we need to break it from the whitespace after "St." thus V1 keeps "8532 West Lyme St." and V2 gets "Chesterfield, VA 23832".


By the end of the operations we should have:

V1 <- c("377 Peninsula St. Ogden","8532 West Lyme St.",...)
V2 <- c("UT 84404","Chesterfield, VA 23832")

EDIT:

I tried following function on V1 to know whether 24th character is a comma:

unlist(lapply(lapply(V1, function(z){substr(z,24,24)}),function(y){y==","}))

which returns:

TRUE FALSE FALSE FALSE FALSE

Now that I have solved one part of the problem, I need a way to apply the formatting logic based on the result above.

i.e. I want to do:

unlist(lapply(lapply(V1, function(z){substr(z,24,24)}),function(y){if(y==","){something1} else if(y==" "){something2}else {something3}}))

Here something1/2/3 come from 1a and 1b above. Need to know how to write this logic.


Solution

  • Consider following using vectorized methods of ifelse, substr, and regexpr (i.e., no apply loops):

    newV1 <- ifelse(substr(V1, 24, 24) == ",",         # CONDITIONALLY CHECK 24TH CHARACTER
                    substr(V1, 1, regexpr(",", V1)-1), # EXTRACT UNTIL 24TH CHARACTER
                    substr(V1, 1, 
                           regexpr(" (?=[^ ]+$)", 
                                   substr(V1, 1, 24), 
                                   perl=TRUE)-1)     # EXTRACT UNTIL LAST SPACE BEFORE 24TH CHAR
                    )
    newV1
    # [1] "377 Peninsula St. Ogden" "8532 West Lyme St."     
    # [3] "43 E. Hilltop Street"    "95 Newcastle St."       
    # [5] "7276 Rose St."        
    
    newV2 <- paste(ifelse(substr(V1, 24, 24) == ",",   # CONDITIONALLY CHECK 24TH CHARACTER
                   substr(V1, regexpr(",", V1)+1, 
                          nchar(V1)),                  # EXTRACT AFTER 24TH CHARACTER
                   substr(V1, 
                          regexpr(" (?=[^ ]+$)", 
                                  substr(V1, 1, 24), 
                                  perl=TRUE)+1, 
                          nchar(V1))),               # EXTRACT AFTER LAST SPACE BEFORE 24TH CHAR
                   V2)                               # PASTE V2 VECTOR ELEMENTWISE
    newV2
    # [1] "UT 84404"                "Chesterfield, VA 23832" 
    # [3] "Hilliard,OH 43026"       "Hendersonville,NC 28792"
    # [5] "Greenville,NC 27834"   
    

    Rextester Demo