r for-loop dplyr automation traveling-salesman

Use R to Efficiently Order Randomly Generated Transects

Problem

I looking for a way to efficiently order randomly selected sampling transects occur around a fixed object. These transects, once generated, need to be ordered in a way that makes sense spatially such that the distance traveled is minimized. This would be done by ensuring that the end point of the current transect is as close as possible to the start point of the next transect. Also, none of the transects can be repeated.

Because there are thousands of transects to order, and this is a very tedious task to do manually, and I am trying to use R to automate this process. I have already generated the transects, each having a start and end point whose location is indicated using a 360-degree system (e.g., 0 is North, 90 is East, 180 is South, and 270 is West). I have also generated some code that seems to indicate the start point and the ID for the next transect, but there are a few problems with this code: (1) it can generate errors depending on the start and end points being considered, (2) it doesn't achieve what I ultimately need it to achieve, and (3) as is, the code itself seems overly complicated and I can't help but wonder if there is a more straightforward way to do this.

Ideally, the code would result in the transects being reordered such they match the order that they should be flown rather than the order that they were initially input.

The Data

For simplicity, let's pretend there are just 10 transects to order.

# Transect ID for the start point
StID <- c(seq(1, 10, 1))

# Location of transect start point, based on a 360-degree circle
StPt <- c(342.1, 189.3, 116.5, 67.9, 72, 208.4, 173.2, 97.8, 168.7, 138.2)

# Transect ID for the end point
EndID <- c(seq(1, 10, 1))

# Location of transect start point, based on a 360-degree circle
EndPt <- c(122.3, 313.9, 198.7, 160.4, 166, 26.7, 312.7, 273.7, 288.8, 287.5)

# Dataframe
df <- cbind.data.frame(StPt, StID, EndPt, EndID)

What I Have Tried

Please feel free to ignore this code, there has to be a better way and it does not really achieve the intended outcome. Right now I am using a nested for-loop that is very difficult to intuitively follow but represents my best attempt thus far.

# Create two new columns that will be populated using a loop
df$StPt_Next <- NA
df$ID_Next <- NA

# Also create a list to be populated as end and start points are matched
used <- c(df$StPt[1]) #puts the start point of transect #1 into the used vector since we will start with 1 and do not want to have it used again

# Then, for every row in the dataframe...
for (i in seq(1,length(df$EndPt)-1, 1)){ # Selects all rows except the last one as the last transect should have no "next" transect
  # generate some print statements to indicate that the script is indeed running while you wait....
  print(paste("######## ENDPOINT", i, ":", df$EndPt[i], " ########"))
  print(paste("searching for a start point that fits criteria to follow this endpoint",sep=""))
  # sequentially select each end point
  valueEndPt <- df[i,1]
  # and order the index by taking the absolute difference of end and start points and, if this value is greater than 180, also subtract from 360 so all differences are less than 180, then order differences from smallest to largest  
  orderx <- order(ifelse(360-abs(df$StPt-valueEndPt) > 180, 
                         abs(df$StPt-valueEndPt),
                         360-abs(df$StPt-valueEndPt)))
  tmp <- as.data.frame(orderx)
  # specify index value
  index=1
  # for as long as there is an "NA" present in the StPt_Next created before for loop...  
  while (is.na(df$StPt_Next[i])) {
    #select the value of the ordered index in sequential order     
    j=orderx[index]
    # if the start point associated with a given index is present in the list of used values...
    if (df$StPt[j] %in% used){
      # then have R print a statement indicate this is the case...
      print(paste("passing ",df$StPt[j], " as it has already been used",sep=""))
      # and move onto the next index
      index=index+1
      # break statement intended to skip the remainder of the code for values that have already been used      
      next
      # if the start point associated with a given index is not present in the list of used values...      
    } else {
      # then identify the start point value associated with that index ID... 
      valueStPt <- df$StPt[j]
      # and have R print a statement indicating an attempt is being made to use the next value      
      print(paste("trying ",df$StPt[j],sep=""))
      # if the end transect number is different from the start end transect number...
      if (df$EndID[i] != df$StID[j]) { 
        # then put the start point in the new column...
        df$StPt_Next[i] <- df$StPt[j]
        # note which record this start point came from for ease of reference/troubleshooting...
        df$ID_Next[i] <- j
        # have R print a statement that indicates a value for the new column has beed selected...        
        print(paste("using ",df$StPt[j],sep=""))
        # and add that start point to the list of used ones
        used <- c(used,df$StPt[j])
        # otherwise, if the end transect number matches the start end transect number...
      } else {
        # keep NA in this column and try again
        df$StPt_Next[i] <- NA
        # and indicate that this particular matched pair can not be used
        print(paste("cant use ",valueStPt," as the column EndID (related to index in EndPt) and StID (related to index in StPt) values are matching",sep=""))
      }# end if else statement to ensure that start and end points come from different transects
      # and move onto the next index
      index=index+1
    }# end if else statement to determine if a given start point still needs to be used
  }# end while loop to identify if there are still NA's in the new column
}# end for loop

The Output

When the code does not produce an explicit error, such as for the example data provided, the output is as follows:

    StPt StID EndPt EndID StPt_Next ID_Next
1  342.1    1 122.3     1      67.9       4
2  189.3    2 313.9     2     173.2       7
3  116.5    3 198.7     3      97.8       8
4   67.9    4 160.4     4      72.0       5
5   72.0    5 166.0     5     116.5       3
6  208.4    6  26.7     6     189.3       2
7  173.2    7 312.7     7     168.7       9
8   97.8    8 273.7     8     138.2      10
9  168.7    9 288.8     9     208.4       6
10 138.2   10 287.5    10        NA      NA

The last two columns were generated by the code and added to the original dataframe. StPt_Next has the location of the next closest start point and ID_Next indicates the transectID associated with that next start point location. The ID_Next column indicates that the order transects should be flown is as follows 1,4,5,3,8,10,NA (aka. the end), and 2,7,9,6 form their own loop that goes back to 2.

There are two specific problems that I can't solve:

(1) There is a problem of forming one continuous chain of sequence. I think this is related to having 1 be the starting transect and 10 being the last transect no matter what, but not knowing how to indicate in the code that the second to last transect must match up with 10 so that the sequence includes all 10 transects before terminating at an "NA" representing the final end point.

(2) To really automate this process, after fixing the early termination of the sequence due to the premature introduction of the "NA" as the ID_next, a new column would be made that would allow the transects to be reordered based on the most efficient progression rather than the original order of their EndID/StartID.

Intended Outcome

If we pretend that we only had 6 transects to order and ignore the 4 that were not able to be ordered due to the premature introduction of the "NA", this would be the intended outcome:

    StPt StID EndPt EndID StPt_Next ID_Next TransNum
1  342.1    1 122.3     1      67.9       4        1
4   67.9    4 160.4     4      72.0       5        2
5   72.0    5 166.0     5     116.5       3        3
3  116.5    3 198.7     3      97.8       8        4
8   97.8    8 273.7     8     138.2      10        5
10 138.2   10 287.5    10        NA      NA        6

EDIT: A Note About the Error Message Explicitly Produced by the Code

As indicated earlier, the code has a few flaws. Another flaw is that it will often produce an error when trying to order a larger number of transects. I am not entirely sure at what step in the process the error is generated, but I am guessing that it is related to the inability to match up the last few transects, possibly due to not meeting the criteria set forth by "orderx". The print statements say "trying NA" instead of a start point in the database, which results in this error: "Error in if (df$EndID[i] != df$StID[j]) { : missing value where TRUE/FALSE needed". I am guessing that there would need to be another if-else statement that somehow indicates "if the remaining points do not meet the orderx criteria, then just force them to match up with whatever transect remains so that everything is assigned a StPt_Next and ID_Next".

Here is a larger dataset that will generate the error:

EndPt <- c(158.7,245.1,187.1,298.2,346.8,317.2,74.5,274.2,153.4,246.7,193.6,302.3,6.8,359.1,235.4,134.5,111.2,240.5,359.2,121.3,224.5,212.6,155.1,353.1,181.7,334,249.3,43.9,38.5,75.7,344.3,45.1,285.7,155.5,183.8,60.6,301,132.1,75.9,112,342.1,302.1,288.1,47.4,331.3,3.4,185.3,62,323.7,188,313.1,171.6,187.6,291.4,19.2,210.3,93.3,24.8,83.1,193.8,112.7,204.3,223.3,210.7,201.2,41.3,79.7,175.4,260.7,279.5,82.4,200.2,254.2,228.9,1.4,299.9,102.7,123.7,172.9,23.2,207.3,320.1,344.6,39.9,223.8,106.6,156.6,45.7,236.3,98.1,337.2,296.1,194,307.1,86.6,65.5,86.6,296.4,94.7,279.9)

StPt <- c(56.3,158.1,82.4,185.5,243.9,195.6,335,167,39.4,151.7,99.8,177.2,246.8,266.1,118.2,358.6,357.9,99.6,209.9,342.8,106.5,86.4,35.7,200.6,65.6,212.5,159.1,297,285.9,300.9,177,245.2,153.1,8.1,76.5,322.4,190.8,35.2,342.6,8.8,244.6,202,176.2,308.3,184.2,267.2,26.6,293.8,167.3,30.5,176,74.3,96.9,186.7,288.2,62.6,331.4,254.7,324.1,73.4,16.4,64,110.9,74.4,69.8,298.8,336.6,58.8,170.1,173.2,330.8,92.6,129.2,124.7,262.3,140.4,321.2,34,79.5,263,66.4,172.8,205.5,288,98.5,335.2,38.7,289.7,112.7,350.7,243.2,185.4,63.9,170.3,326.3,322.9,320.6,199.2,287.1,158.1)

EndID <- c(seq(1, 100, 1))

StID <- c(seq(1, 100, 1))

df <- cbind.data.frame(StPt, StID, EndPt, EndID)

Any advice would be greatly appreciated!

Solution

Thanks everyone for the suggestions, @Simon's solution was most tailored to my OP. @Geoffrey's actual approach of using x,y coordinates was great as it allows for the plotting of the transects and the travel order. Thus, I am posting a hybrid solution that was generated using code by both of them and well as additional comments and code to detail the process and get to the actual end result I was aiming for. I am not sure if this will help anyone in the future but, since there was no answer that provided a solution that solved my problem 100% of the way, I thought I'd share what I came up with.

As others have noted, this is a type of traveling salesperson problem. It is asymmetric because the distance from the end of transect "t" to the beginning of transect "t+1" is not the same as the distance from the end transect "t+1" to the start of transect "t". Also, it is a "path" solution rather than a "cycle" solution.

#=========================================
# Packages
#=========================================
library(TSP)
library(useful)
library(dplyr)

#=========================================
# Full dataset for testing
#=========================================
EndPt <- c(158.7,245.1,187.1,298.2,346.8,317.2,74.5,274.2,153.4,246.7,193.6,302.3,6.8,359.1,235.4,134.5,111.2,240.5,359.2,121.3,224.5,212.6,155.1,353.1,181.7,334,249.3,43.9,38.5,75.7,344.3,45.1,285.7,155.5,183.8,60.6,301,132.1,75.9,112,342.1,302.1,288.1,47.4,331.3,3.4,185.3,62,323.7,188,313.1,171.6,187.6,291.4,19.2,210.3,93.3,24.8,83.1,193.8,112.7,204.3,223.3,210.7,201.2,41.3,79.7,175.4,260.7,279.5,82.4,200.2,254.2,228.9,1.4,299.9,102.7,123.7,172.9,23.2,207.3,320.1,344.6,39.9,223.8,106.6,156.6,45.7,236.3,98.1,337.2,296.1,194,307.1,86.6,65.5,86.6,296.4,94.7,279.9)

StPt <- c(56.3,158.1,82.4,185.5,243.9,195.6,335,167,39.4,151.7,99.8,177.2,246.8,266.1,118.2,358.6,357.9,99.6,209.9,342.8,106.5,86.4,35.7,200.6,65.6,212.5,159.1,297,285.9,300.9,177,245.2,153.1,8.1,76.5,322.4,190.8,35.2,342.6,8.8,244.6,202,176.2,308.3,184.2,267.2,26.6,293.8,167.3,30.5,176,74.3,96.9,186.7,288.2,62.6,331.4,254.7,324.1,73.4,16.4,64,110.9,74.4,69.8,298.8,336.6,58.8,170.1,173.2,330.8,92.6,129.2,124.7,262.3,140.4,321.2,34,79.5,263,66.4,172.8,205.5,288,98.5,335.2,38.7,289.7,112.7,350.7,243.2,185.4,63.9,170.3,326.3,322.9,320.6,199.2,287.1,158.1)

EndID <- c(seq(1, 100, 1))

StID <- c(seq(1, 100, 1))

df <- cbind.data.frame(StPt, StID, EndPt, EndID)

#=========================================
# Convert polar coordinates to cartesian x,y data 
#=========================================
# Area that the transect occupy in space only used for graphing
planeDim <- 1
# Number of transects
nTransects <- 100

# Convert 360-degree polar coordinates to x,y cartesian coordinates to facilitate calculating a distance matrix based on the Pythagorean theorem
EndX <- as.matrix(pol2cart(planeDim, EndPt, degrees = TRUE)["x"])
EndY <- as.matrix(pol2cart(planeDim, EndPt, degrees = TRUE)["y"])
StX <- as.matrix(pol2cart(planeDim, StPt, degrees = TRUE)["x"])
StY <- as.matrix(pol2cart(planeDim, StPt, degrees = TRUE)["y"])

# Matrix of x,y pairs for the beginning ("b") and end ("e") points of each transect
b <- cbind(c(StX), c(StY))
e <- cbind(c(EndX), c(EndY))

#=========================================
# Function to calculate the distance from all endpoints in the ePts matrix to a single beginning point in bPt
#=========================================
dist <- function(ePts, bPt) {
  # Use the Pythagorean theorem to calculate the hypotenuse (i.e., distance) between every end point in the matrix ePts to the point bPt
  apply(ePts, 1, function(p) sum((p - bPt)^2)^0.5)
}

#=========================================
# Distance matrix
#=========================================
# Apply the "dist" function to all beginning points to create a matrix that has the distance between every start and endpoint
## Note: because this is an asymmetric traveling salesperson problem, the distance matrix is directional, thus, the distances at any position in the matrix must be the distance from the transect shown in the row label and to the transect shown in the column label
distMatrix <- apply(b, 1, FUN = dist, ePts = e)

## Set the distance between the beginning and end of each transect to zero so that there is no "cost" to walking the transect
diag(distMatrix) <- 0

#=========================================
# Solve asymmetric TSP
#=========================================
# This creates an instance of the asymmetric traveling salesperson (ASTP) 
atsp <- as.ATSP(distMatrix)
# This creates an object of Class Tour that travels to all of the points 
## In this case, the repetitive_nn produces the smallest overall and transect-to-transect
tour <- solve_TSP(atsp, method = "repetitive_nn")


#=========================================
# Create a path by cutting the tour at the most "expensive" transition 
#=========================================
# Sort the original data frame to match the order of the solution
dfTour = df[as.numeric(tour),]

# Add the following columns to the original dataframe: 
dfTour = dfTour %>%
  # Assign visit order (1 to 100, ascending) 
  mutate(visit_order = 1:nrow(dfTour)) %>%
  # The ID of the next transect to move to
  mutate(next_StID = lead(StID, order_by = visit_order),
  # The angle of the start point for the next transect
         next_StPt = lead(StPt, order_by = visit_order))

# lead() generates the NA's in the last record for next_StID, next_StPt, replace these by adding that information
dfTour[dfTour$visit_order == nrow(dfTour),'next_StID'] <-
  dfTour[dfTour$visit_order == 1,'StID']

dfTour[dfTour$visit_order == nrow(dfTour),'next_StPt'] <-
  dfTour[dfTour$visit_order == 1,'StPt']

# Function to calculate distance for 360 degrees rather than x,y coordinates
transect_distance <- function(end,start){
  abs_dist = abs(start-end)
  ifelse(360-abs_dist > 180, abs_dist, 360-abs_dist)
}

# Compute distance between end of each transect and start of next using polar coordinates
dfTour = dfTour %>% mutate(dist_between = transect_distance(EndPt, next_StPt))

# Identify the longest transition point for breaking the cycle
min_distance <- sum(dfTour$dist_between) - max(dfTour$dist_between)
path_start <- dfTour[dfTour$dist_between == max(dfTour$dist_between), 'next_StID']
path_end <- dfTour[dfTour$dist_between == max(dfTour$dist_between), 'EndID']

# Make a statement about the least cost path
print(sprintf("minimum cost path = %.2f, starting at node %d, ending at node %d",
              min_distance, path_start, path_end))

# The variable path shows the order in which you should visit the transects
path <- cut_tour(tour, path_start, exclude_cut = F) 

# Arrange df from smallest to largest travel distance
tmp1 <- dfTour %>%
  arrange(dist_between)

# Change dist_between and visit_order to NA for transect with the largest distance to break cycle 
# (e.g., we will not travel this distance, this represents the path endpoint) 
tmp1[length(dfTour$dist_between):length(dfTour$dist_between),8] <- NA
tmp1[length(dfTour$dist_between):length(dfTour$dist_between),5] <- NA

# Set df order back to ascending by visit order
tmp2 <- tmp1 %>%
  arrange(visit_order)

# Detect the break in a sequence of visit_order introduced by the NA (e.g., 1,2,3....5,6) and mark groups before the break with 0 and after the break with 1 in the "cont_per" column 
tmp2$cont_per <- cumsum(!c(TRUE, diff(tmp2$visit_order)==1))
# Sort "cont_per" such that the records following the break become the beginning of the path and the ones following the break represent the middle orders and the point with the NA being assigned the last visit order, and assign a new visit order
tmp3 <- tmp2%>%
  arrange(desc(cont_per))%>%
  mutate(visit_order_FINAL=seq(1, length(tmp2$visit_order), 1))

# Datframe ordered by progression of transects
trans_order <- cbind.data.frame(tmp3[2], tmp3[1], tmp3[4], tmp3[3], tmp3[6], tmp3[7], tmp3[8], tmp3[10])
# Insert NAs for "next" info for final transect
trans_order[nrow(trans_order),'next_StPt'] <- NA 
trans_order[nrow(trans_order), 'next_StID'] <- NA

#=========================================
# View data
#=========================================
head(trans_order)

#=========================================
# Plot
#=========================================
#For fun, we can visualize the transects:
# make an empty graph space
plot(1,1, xlim = c(-planeDim-0.1, planeDim+0.1), ylim = c(-planeDim-0.1, planeDim+0.1), ty = "n")

# plot the beginning of each transect as a green point, the end as a red point,
and a grey line representing the transect
for(i in 1:nrow(e)) {
  xs = c(b[i,1], e[i,1])
  ys = c(b[i,2], e[i,2])
  lines(xs, ys, col = "light grey", lwd = 1, lty = 1)
  points(xs, ys, col = c("green", "red"), pch = 1, cex = 1)
  #text((xs), (ys), i)
}

# Add the path to the visualization
for(i in 1:(length(path)-1)) {
    # This makes a line between the x coordinates for the end point of path i and beginning point of path i+1 
    lines(c(e[path[i],1], b[path[i+1],1]), c(e[path[i],2], b[path[i+1], 2]), lty = 1, lwd=1)
}

This is what the end result looks like