I'm trying to find the next closest store from a matrix of store IDs, zip codes, and long/latitude coordinates for each of the zip codes. Trouble happens when there are more than 1 store per zipcode, and the script doesn't know how to order 2 values that are identical (store x is 10 miles away, store y is 10 miles, and has trouble with the order of x and y, and is returning (c(x,y)), instead of x,y or y,x). I need to find a way to have my code figure out how to list both of them (arbituary order since they are the same distance away from the store, based on zip code).
I'm thinking there can likely be modifications to the which() function, but I'm not having any luck.
Note that all the stores run, just the 100 or so stores that have the same zipcode as another store get tripped up - I'd love to not manually go through and edit the csv.
library(data.table)
library(zipcode)
library(geosphere)
source<-read.csv("C:\\Users\\mcan\Desktop\\Projects\\Closest Store\\Site and Zip.csv",header=TRUE, sep=",") #open
zip<-source[,2] #break apart the source zip codes
ID<-source[,1] #break apart the IDs
zip<-clean.zipcodes(zip) #clean up the zipcodes
CleanedData<-data.frame(ID,zip) #combine the IDs and cleaned Zip codes
CleanedData<-merge(x=CleanedData,y=zipcode,by="zip",all.x=TRUE) #dataset of store IDs, zipcodes, and their long/lat positions
setDT(CleanedData) #set data frame to data table
storeDistances <- distm(CleanedData[,.(longitude,latitude)],CleanedData[,.(longitude,latitude)]) #matrix between long/lat points of all stores in list
colnames(storeDistances) <- rownames(storeDistances) <- CleanedData[,ID]
whatsClosest <- function(number=1){
apply(storeDistances,1,function(x) (colnames(storeDistances)[which(x==sort(x)[number+1])])) #sorts in descending order and picks the 2nd closest distance, matches with storeID
}
CleanedData[,firstClosestSite:=whatsClosest(1)] #looks for 1st closest store
CleanedData[,secondClosestSite:=whatsClosest(2)] #looks for 2nd closest store
CleanedData[,thirdClosestSite:=whatsClosest(3)] #looks for 3rd closest store
Data set format:
Classes ‘data.table’ and 'data.frame': 1206 obs. of 9 variables:
$ zip : Factor w/ 1182 levels "01234","02345",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ID : int 11111 12222 13333 10528 ...
$ city : chr "Boston" "Somerville" "Cambridge" "Weston" ...
$ state : chr "MA" "MA" "MA" "MA" ...
$ latitude : num 40.0 41.0 42.0 43.0 ...
$ longitude : num -70.0 -70.1 -70.2 -70.3 -70.4 ...
$ firstClosestSite :List of 1206
..$ : chr "12345"
$ secondClosestSite :List of 1206
..$ : chr "12344"
$ thirdClosestSite :List of 1206
..$ : chr "12343"
Issue comes with the firstClosestSite and secondClosest site, they are sorted by distance, but if the distance is the same because two stores exist in the same zipcode, the which() function (I think) doesn't know how to account for this, so you get this awkward concatenation in the CSV:
StoreID Zip City State Longitude Latitude FirstClosestSite
11222 11000 Boston MA 40.0 -70.0 c("11111""12222")
SecondClosestSite ThirdClosestSite
c("11111" "12222") 13333
Example of how the distance matrix is formed (store IDs in first row and column, with the matrix values being the distance between store IDs):
11111 22222 33333 44444 55555 66666
11111 0 6000 32000 36000 28000 28000
22222 6000 0 37500 40500 32000 32000
33333 32000 37500 0 11000 6900 6900
44444 36000 40500 11000 0 8900 8900
55555 28000 32000 6900 8900 0 0
66666 28000 32000 6900 8900 0 0
Issue is the duplicates in each row... the which() doesn't know which store is closest to 11111 (either 55555 or 66666).
Here is my attempt at a solution. Everything up until the line with colnames(storeDistances) <- ...
stays the same. After that, you should replace the code with the following:
whatsClosestList <- sapply(as.data.frame(storeDistances), function(x) list(data.frame(distance = x, store = rownames(storeDistances), stringsAsFactors = F)))
# Get the names of the stores
# this step is necessary because lapply doesn't allow us
# to access the list names
storeNames = names(whatsClosestList)
# Iterate through each store's data frame using storeNames
# and delete the distance to itself
whatsClosestListRemoveSelf <- lapply(storeNames, function(name) {
df <- whatsClosestList[[name]]
df <- df[!df$store == name,]
})
# The previous step got rid of the store names in the list,
# so we add them again here
names(whatsClosestListRemoveSelf) <- storeNames
whatsClosestOrderedList <- lapply(whatsClosestListRemoveSelf, function(df) { df[order(df$distance),] })
whatsClosestTopThree <- lapply(whatsClosestOrderedList, function(df) { df$store[1:3] })
firstClosestSite <- lapply(whatsClosestTopThree, function(x) { x[1]} )
secondClosestSite <- lapply(whatsClosestTopThree, function(x) { x[2]} )
thirdClosestSite <- lapply(whatsClosestTopThree, function(x) { x[3]} )
CleanedData[,firstClosestSite:=firstClosestSite] #looks for 1st closest store in list
CleanedData[,secondClosestSite:=secondClosestSite] #looks for 2nd closest store in list
CleanedData[,thirdClosestSite:=thirdClosestSite] #looks for 3rd closest store in list
Basically, instead of searching only for the (first, second, third) closest site, I create a list of dataframes for each store with the distance to all other stores. I then order these dataframes, and extract the three closest stores, which sometimes include ties (if tied, they're ordered by the name of the store). Then you only need to extract a list with the firstClosestSite, secondClosestSite, etc., for each store, and that's why you use in the search in CleanedData
. Hope it works!