Search code examples

Conditional updating coordinate column in dataframe

I am attempting to populate two newly empty columns in a data frame with data from other columns in the same data frame in different ways depending on if they are populated.

I am trying to populate the values of HIGH_PRCN_LAT and HIGH_PRCN_LON (previously called F_Lat and F_Lon) which represent the final latitudes and londitudes for those rows this will be based off the values of the other columns in the table.

Case 1: Lat/Lon2 are populated (like in IDs 1 & 2), using the great circle algorithm a midpoint between them should be calculated and then placed into F_Lat & F_Lon.

Case 2: Lat/Lon2 are empty, then the values of Lat/Lon1 should be put into F_Lat and F_Lon (like with IDs 3 & 4).

My code is as follows but doesn't work (see previous versions, removed in an edit).

The preperatory code I am using is as follows:

incidents <- structure(list(id = 1:9, StartDate = structure(c(1L, 3L, 2L, 
2L, 2L, 3L, 1L, 3L, 1L), .Label = c("02/02/2000 00:34", "02/09/2000 22:13", 
"20/01/2000 14:11"), class = "factor"), EndDate = structure(1:9, .Label = c("02/04/2006 20:46", 
"02/04/2006 22:38", "02/04/2006 23:21", "02/04/2006 23:59", "03/04/2006 20:12", 
"03/04/2006 23:56", "04/04/2006 00:31", "07/04/2006 06:19", "07/04/2006 07:45"
), class = "factor"), Yr.Period = structure(c(1L, 1L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L), .Label = c("2000 / 1", "2000 / 2", "2000 /3"
), class = "factor"), Description = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), .Label = "ENGLISH TEXT", class = "factor"), 
    Location = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L
    ), .Label = c("Location 1", "Location 1 : Location 2"), class = "factor"), 
    Location.1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L), .Label = "Location 1", class = "factor"), Postcode.1 = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Postcode 1", class = "factor"), 
    Location.2 = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 
    1L), .Label = c("", "Location 2"), class = "factor"), Postcode.2 = structure(c(2L, 
    2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("", "Postcode 2"
    ), class = "factor"), Section = structure(c(2L, 2L, 3L, 1L, 
    4L, 4L, 2L, 1L, 4L), .Label = c("East", "North", "South", 
    "West"), class = "factor"), Weather.Category = structure(c(1L, 
    2L, 4L, 2L, 2L, 2L, 4L, 1L, 3L), .Label = c("Animals", "Food", 
    "Humans", "Weather"), class = "factor"), Minutes = c(13L, 
    55L, 5L, 5L, 5L, 522L, 1L, 11L, 22L), Cost = c(150L, 150L, 
    150L, 20L, 23L, 32L, 21L, 11L, 23L), Location.1.Lat = c(53.0506727, 
    53.8721035, 51.0233529, 53.8721035, 53.6988355, 53.4768766, 
    52.6874562, 51.6638245, 51.4301359), Location.1.Lon = c(-2.9991256, 
    -2.4004125, -3.0988341, -2.4004125, -1.3031529, -2.2298073, 
    -1.8023421, -0.3964916, 0.0213837), Location.2.Lat = c(52.7116187, 
    53.746791, NA, 53.746791, 53.6787167, 53.4527824, 52.5264907, 
    NA, NA), Location.2.Lon = c(-2.7493169, -2.4777984, NA, -2.4777984, 
    -1.489026, -2.1247029, -1.4645023, NA, NA)), class = "data.frame", row.names = c(NA, -9L))

#gpsColumns is used as the following line of code is used for several data frames.
gpsColumns <- c("HIGH_PRCN_LAT", "HIGH_PRCN_LON")
incidents [ , gpsColumns] <- NA

#create separate variable(?) containing a list of which rows are complete
ind <- complete.cases(incidents [,17])

#populate rows with a two Lat/Lons with great circle middle of both values
incidents [ind, c("HIGH_PRCN_LON_2","HIGH_PRCN_LAT_2")] <- 
  with(incidents [ind,,drop=FALSE],, geosphere::midPoint(, Location.1.Lat),, Location.2.Lat))))

#populate rows with one Lat/Lon with those values
incidents[!ind, c("HIGH_PRCN_LAT","HIGH_PRCN_LON")] <- incidents[!ind, c("Location.1.Lat","Location.1.Lon")]

I will use the geosphere::midPoint function based off a recommendation here:

Unfortunately, it doesn't appear that this way of populating the column will work when there are several cases.

The current error that is thrown is:

Error in `$<`(`*tmp*`, F_Lat, value = integer(0)) : 
  replacement has 0 rows, data has 178012

Edit: also posted to reddit:

Edit: Added clarity on the parts of the code I do not understand.

#replaces the F_Lat2/F_Lon2 columns in rows with a both sets of input coordinates 
dataframe[ind, c("F_Lat2","F_Lon2")] <-
#I am unclear on what this means, specifically what the "with" function does and what "drop=FALSE" does and also why they were used in this case.
#I am unclear on what and rbind are doing here, but the second half (geosphere onwards) is binding the Lats and Lons to make coordinates as inputs for the gcIntermediate function., geosphere::gcIntermediate(, Lon1),
                                      , Lon2), n = 1)))


  • Though your code doesn't work as-written for me, and I cannot calculate the same precise values your expect, I suspect the error your seeing can be fixed with these steps. (Data is down at the bottom here.)

    1. Pre-populate the empty columns.
    2. Pre-calculate the complete.cases step, it'll save time.
    3. Use for inside gcIntermediate.

    I'm inferring from

                   this is an error in R

    that you are binding those columns together, so I'll use (Using cbind itself produced some ignorable warnings from geosphere, so you can use it instead and perhaps suppressWarnings, but that function is a little strong in that it'll mask other warnings as well.)

    Also, since it appears you want one intermediate value for each pair of coordinates, I added the gcIntermediate(..., n=1) argument.

    The use of, ...) is because gcIntermediate returns a list, so we need to bring them together.

    dataframe$F_Lon2 <- dataframe$F_Lat2 <- NA_real_
    ind <- complete.cases(dataframe[,4])
    dataframe[ind, c("F_Lat2","F_Lon2")] <- 
 , geosphere::gcIntermediate(, Lon1),
                                          , Lon2), n = 1)))
    dataframe[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
    #   ID     Lat1      Lon1     Lat2      Lon2    F_Lat     F_Lon   F_Lat2    F_Lon2
    # 1  1 19.05067 -3.999126 92.71332 -6.759169 55.88200 -5.379147 55.78466 -6.709509
    # 2  2 58.87210 -1.400413 54.74679 -4.479840 56.80945 -2.940126 56.81230 -2.942029
    # 3  3 33.02335 -5.098834       NA        NA 33.02335 -5.098834 33.02335 -5.098834
    # 4  4 54.87210 -4.400412       NA        NA 54.87210 -4.400412 54.87210 -4.400412

    Update, using your new incidents data and switching to geosphere::midPoint.

    Try this:

    incidents$F_Lon2 <- incidents$F_Lat2 <- NA_real_
    ind <- complete.cases(incidents[,4])
    incidents[ind, c("F_Lat2","F_Lon2")] <- 
    incidents[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]

    One (big) difference is that geosphere::gcIntermediate(..., n=1) returns a list of results, whereas geosphere::midPoint(...) (no n=) returns just a matrix, so no rbinding required.


    dataframe <- read.table(header=T, stringsAsFactors=F, text="
    ID Lat1       Lon1       Lat2      Lon2      F_Lat       F_Lon
    1  19.0506727 -3.9991256 92.713318 -6.759169 55.88199535 -5.3791473
    2  58.8721035 -1.4004125 54.746791 -4.47984  56.80944725 -2.94012625
    3  33.0233529 -5.0988341 NA        NA        33.0233529  -5.0988341
    4  54.8721035 -4.4004125 NA        NA        54.8721035  -4.4004125")