Search code examples
rzipcode

which() function in R - after sorting in descending order, issues matching with duplicate values


I'm trying to find the next closest store from a matrix of store IDs, zip codes, and long/latitude coordinates for each of the zip codes. Trouble happens when there are more than 1 store per zipcode, and the script doesn't know how to order 2 values that are identical (store x is 10 miles away, store y is 10 miles, and has trouble with the order of x and y, and is returning (c(x,y)), instead of x,y or y,x). I need to find a way to have my code figure out how to list both of them (arbituary order since they are the same distance away from the store, based on zip code).

I'm thinking there can likely be modifications to the which() function, but I'm not having any luck.

Note that all the stores run, just the 100 or so stores that have the same zipcode as another store get tripped up - I'd love to not manually go through and edit the csv.

library(data.table)
library(zipcode)
library(geosphere)
source<-read.csv("C:\\Users\\mcan\Desktop\\Projects\\Closest Store\\Site and Zip.csv",header=TRUE, sep=",") #open
zip<-source[,2] #break apart the source zip codes 
ID<-source[,1] #break apart the IDs
zip<-clean.zipcodes(zip) #clean up the zipcodes 
CleanedData<-data.frame(ID,zip) #combine the IDs and cleaned Zip codes
CleanedData<-merge(x=CleanedData,y=zipcode,by="zip",all.x=TRUE) #dataset of store IDs, zipcodes, and their long/lat positions
setDT(CleanedData) #set data frame to data table 
storeDistances <- distm(CleanedData[,.(longitude,latitude)],CleanedData[,.(longitude,latitude)]) #matrix between long/lat points of all stores in list 
colnames(storeDistances) <- rownames(storeDistances) <- CleanedData[,ID] 
whatsClosest <- function(number=1){
    apply(storeDistances,1,function(x) (colnames(storeDistances)[which(x==sort(x)[number+1])])) #sorts in descending order and picks the 2nd closest distance, matches with storeID
}
CleanedData[,firstClosestSite:=whatsClosest(1)] #looks for 1st closest store
CleanedData[,secondClosestSite:=whatsClosest(2)] #looks for 2nd closest store
CleanedData[,thirdClosestSite:=whatsClosest(3)] #looks for 3rd closest store 

Data set format:

 Classes ‘data.table’ and 'data.frame': 1206 obs. of  9 variables:
     $ zip              : Factor w/ 1182 levels "01234","02345",..: 1 2 3 4 5 6 7 8 9 10 ...
     $ ID               : int  11111 12222 13333 10528 ...
     $ city             : chr  "Boston" "Somerville" "Cambridge" "Weston" ...
     $ state            : chr  "MA" "MA" "MA" "MA" ...
     $ latitude         : num  40.0 41.0 42.0 43.0 ...
     $ longitude        : num  -70.0 -70.1 -70.2 -70.3 -70.4 ...
    $ firstClosestSite :List of 1206
      ..$ : chr "12345"
    $ secondClosestSite :List of 1206
      ..$ : chr "12344"
    $ thirdClosestSite :List of 1206
      ..$ : chr "12343"

Issue comes with the firstClosestSite and secondClosest site, they are sorted by distance, but if the distance is the same because two stores exist in the same zipcode, the which() function (I think) doesn't know how to account for this, so you get this awkward concatenation in the CSV:

StoreID      Zip       City       State    Longitude  Latitude FirstClosestSite
11222       11000     Boston      MA       40.0       -70.0    c("11111""12222")
    
SecondClosestSite     ThirdClosestSite
c("11111"    "12222")   13333

Example of how the distance matrix is formed (store IDs in first row and column, with the matrix values being the distance between store IDs):

    11111   22222     33333   44444   55555   66666
11111   0      6000    32000   36000  28000   28000
22222   6000    0      37500   40500  32000   32000
33333   32000   37500   0      11000   6900   6900
44444   36000   40500   11000   0     8900    8900
55555   28000   32000   6900    8900    0     0
66666   28000   32000   6900    8900    0     0

Issue is the duplicates in each row... the which() doesn't know which store is closest to 11111 (either 55555 or 66666).


Solution

  • Here is my attempt at a solution. Everything up until the line with colnames(storeDistances) <- ... stays the same. After that, you should replace the code with the following:

    whatsClosestList <- sapply(as.data.frame(storeDistances), function(x) list(data.frame(distance = x, store = rownames(storeDistances), stringsAsFactors = F)))
    
    # Get the names of the stores
    # this step is necessary because lapply doesn't allow us
    # to access the list names
    storeNames = names(whatsClosestList)
    
    # Iterate through each store's data frame using storeNames
    # and delete the distance to itself
    whatsClosestListRemoveSelf <- lapply(storeNames, function(name) {
      df <- whatsClosestList[[name]]
      df <- df[!df$store == name,]
    })
    
    # The previous step got rid of the store names in the list,
    # so we add them again here
    names(whatsClosestListRemoveSelf) <- storeNames
    
    whatsClosestOrderedList <- lapply(whatsClosestListRemoveSelf, function(df) { df[order(df$distance),] })
    
    whatsClosestTopThree <- lapply(whatsClosestOrderedList, function(df) { df$store[1:3] })
    
    firstClosestSite <- lapply(whatsClosestTopThree, function(x) { x[1]} )
    secondClosestSite <- lapply(whatsClosestTopThree, function(x) { x[2]} )
    thirdClosestSite <- lapply(whatsClosestTopThree, function(x) { x[3]} )
    
    CleanedData[,firstClosestSite:=firstClosestSite] #looks for 1st closest store in list
    CleanedData[,secondClosestSite:=secondClosestSite] #looks for 2nd closest store in list 
    CleanedData[,thirdClosestSite:=thirdClosestSite] #looks for 3rd closest store in list
    

    Basically, instead of searching only for the (first, second, third) closest site, I create a list of dataframes for each store with the distance to all other stores. I then order these dataframes, and extract the three closest stores, which sometimes include ties (if tied, they're ordered by the name of the store). Then you only need to extract a list with the firstClosestSite, secondClosestSite, etc., for each store, and that's why you use in the search in CleanedData. Hope it works!