Search code examples
stringsplitstringr

How would I go about splitting a string of values that contains both x and y coordinates into two subsets using a pattern?


I am trying to figure out the best way to split a string of values.. Each string is a series of xy pixel coordinates that ultimately form a polygon. But I can't seem to find a solution where I can split the string into two subsets.. one with all the x coordinates and one with all the y coordinates for each polygon.

This is the current format:

polygonID Points
1 [x1,y1,x2,y2,x3,y3...]
2 [x1,y1,x2,y2,x3,y3...]

Example of values: [[1057.97, 338.98, 1069.53, 322.73,..........]] [[ x1 , y1, x2 , y2,...........]]

So you can see the first two values create an xy pair, and therefore I would need to pull the first x and then every other value after to subset all x coordinate values.. and do the same for all y coordinates to create two columns of points.

(side note: the length of coordinate points per polygon varies)

Ultimately what I want is two lists then which would look like this:

polygonID X_coords Y_coords
1 [x1,x2,x3,...] [y1,y2,y3,...]
2 [x1,x2,x3,...] [y1,y2,y3,...]

I have looked at options with stringr and dplyr, but I have not found a good solution (I also don't have any code worked out just yet as I am trying to gain any insight first). Any and all help is appreciated. Thanks :)


Solution

  • Ok so I'm by no means an expert and I might be complicating things but at least I think I have a working answer. If I understand correctly, the column "Points" of your data-frame (which I will call df) is a character column.

    Then:

    df %>%
    mutate(Points = strsplit(gsub("\\[|\\]","",Points), ","),
          xcoord = paste0("[", sapply(map(Points, ~.x[c(TRUE, FALSE)]), paste, collapse = ","),"]" ),
          ycoord = paste0("[", sapply(map(Points, ~.x[c(FALSE, TRUE)]), paste, collapse = ","),"]" )) %>%
    select(-Points)
    

    I start by removing the brackets in your column "Points" with gsub(), then split the strings with strsplit(). If you want to keep the Points column, just rename that result.

    Then, in order:

    1. I select every other element in Points by mapping and recycling the logical vector c(TRUE, FALSE) (inspired from this post: Select every other element from a vector)
    2. I use sapply to paste together the list from point 1 with "," as a separator.
    3. I'm not sure if you need the brackets but I use paste0 to paste brackets before and after the result from point 2.
    4. I use select() to remove the Points column (if not needed anymore).