I want to remove every city which is within 20km from another city, keeping the first city. I already calculated the distance between each city and the nearest public airport.
geocitylist["OSA"][,c("airport_code","cityname","tmpkey","Population","latitude","longitude","distance")]
airport_code cityname tmpkey Population latitude longitude distance
1: OSA Kishiwada jp kishiwada 205563 34.467 135.367 12.103398
2: OSA Izumi jp izumi 189087 34.483 135.433 18.389912
3: OSA Tondabayashi jp tondabayashi 132875 34.500 135.600 33.600850
4: OSA Kashihara jp kashihara 126224 34.450 135.767 47.995914
5: OSA Habikino jp habikino 121052 34.534 135.583 33.238086
6: OSA Kaizuka jp kaizuka 92633 34.450 135.350 10.036157
7: OSA Izumiotsu jp izumiotsu 80773 34.500 135.400 16.417087
8: OSA Tenri jp tenri 71054 34.583 135.833 56.642112
9: OSA Tanabe jp tanabe 69564 33.733 135.367 77.976596
10: OSA Kashiba jp kashiba 69391 34.535 135.709 44.241938
expected output (for first 10):
airport_code cityname tmpkey Population latitude longitude distance
OSA Kishiwada jp kishiwada 205563 34.467 135.367 12.103398
OSA Tondabayashi jp tondabayashi 132875 34.500 135.600 33.600850
OSA Tenri jp tenri 71054 34.583 135.833 56.642112
OSA Tanabe jp tanabe 69564 33.733 135.367 77.976596
reasons for expected output (for first 10):
airport_code cityname tmpkey Population latitude longitude distance
1: OSA Kishiwada jp kishiwada 205563 34.467 135.367 12.103398
2: --DELETED-- (6km from first surviving row, because it was already filtered out it won't be checked with all the other rows)
3: OSA Tondabayashi jp tondabayashi 132875 34.500 135.600 33.600850
4: --DELETED-- (16km from first surviving row, because it was already filtered out it won't be checked with all the other rows)
5: --DELETED-- (survived first row check but not second row check; 4km from second surviving row)
6: --DELETED-- (2km from first surviving row, because it was already filtered out it won't be checked with all the other rows)
7: --DELETED-- (5km from first surviving row, because it was already filtered out it won't be checked with all the other rows)
8: OSA Tenri jp tenri 71054 34.583 135.833 56.642112
9: OSA Tanabe jp tanabe 69564 33.733 135.367 77.976596
10: --DELETED-- (survived first row check but not second row check; 11km from second surviving row)
(the third row and fourth row were >20km from each other so both were safe)
further explanation:
for all rows of the same airport_code the function would calculate the distance between each row. As far as I'm aware it only needs forward comparing.
Here's how I made the expected output: I took the latlongs from row 1 and I plugged the pair into a distance calculator as the first pair of latlongs. For the second pair of the distance calc I plugged in the second row to see if it was closer than 20km.
The distance was 6km so row 2 failed the check.
Then I went and compared row 1 and row 3: pass.
row 1 and row 4: fail.
r1 & r5: pass.
r1 & r6: fail.
r1 & r7: fail.
r1 & r8: pass.
r1 & r9: pass.
r1 & r10: pass.
checking r1 is finished so there should be no other cities near r1, now we proceed to r2.
r2 & r3:r10: skip (r2 already failed).
now we check r3.
r3 & r4: skip (r4 already failed).
r3 & r5: fail.
r3 & r6,r7: skip (r6,r7 already failed).
r3 & r8: pass.
r3 & r9: pass.
r3 & r10: fail.
r4:r7 & r5:r10: skip (r4:r7 already failed).
r8 & r9: pass.
r8 & r10: skip (r10 already failed).
r9 & r10: skip (r10 already failed).
DONE
My idea is to put everything in a list and then have some kind of function which will identify which rows to delete.
list <- split(df[, -1], df$airport_code)
require(gmt)
lapply(list, function(x)
geodist(geocitylist$city1[i],geocitylist$city1[i],geocitylist$city2[i],geocitylist$city2[i],units="km")), something...)
I'm not sure where to go from here...
dput:
structure(list(airport_code = c("OSA", "OSA", "OSA", "OSA", "OSA",
"OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA",
"OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA",
"OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA",
"OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA", "OSA",
"OSA", "OSA", "OSA", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO",
"ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO",
"ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO",
"ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO",
"ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO",
"ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO",
"ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO",
"ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO",
"ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO",
"ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO", "ILO",
"ILO"), cityname = c("Kishiwada", "Izumi", "Tondabayashi", "Kashihara",
"Habikino", "Kaizuka", "Izumiotsu", "Tenri", "Tanabe", "Kashiba",
"Sennan", "Sakurai", "Hannan", "Takaishi", "Osakasayama", "Hashimoto",
"Iwade", "Kainan", "Sumoto", "Gojo", "Gose", "Tawaramoto", "Gobo",
"Kawai", "Kumano", "Haibara", "Asuka", "Awaji", "Kamitonda",
"Kawachinagano", "Kimino", "Koya", "Kozagawa-Cho", "Minabe",
"Misaki", "Nachikatsuura", "Nosegawa", "Shirahama", "Susami",
"Taiji", "Tenkawa", "Uda", "Yoshino", "Yura", "Iloilo", "Barotac Nuevo",
"Trapiche", "Tuyom", "Inayawan", "Jordan", "Alimodian", "Guimbal",
"Dingle", "Cabatuan", "Igbaras", "Pavia", "Cabano", "Patnongon",
"Ungka", "Leon", "Bulata", "Tumcon", "Caliling", "Hamtic", "Belison",
"Buray", "Cagbang", "Masaling", "Duenas", "Linaon", "Bingawan",
"Maasin", "Igang", "Cartagena", "Tiling", "Maribong", "Napnapan",
"Zarraga", "Concordia", "New Lucena", "Dao", "Aglalana", "Bugasong",
"Alibunan", "Jamabalud", "Egana", "Calaya", "Constancia", "Pakiad",
"Nueva Valencia", "Jibao-an", "Mina", "Bolilao", "San Enrique",
"Cordova", "Lawigan", "Piape", "Aganan", "Ponong", "Gines", "Leganes",
"Jaguimitan", "East Valencia", "Morobuan", "Atabayan", "Avila",
"Catungan", "Ermita", "Igcocolo", "Tiwi", "Balibagan", "Sulangan",
"Jalaud", "Tiring", "Abangay", "Guisijan", "Abilay", "Monpon",
"Aureliana", "Tigum", "Quinagaringan", "Abaca", "Mapili", "Da-an",
"Cabilauan", "Getulio", "Pina", "Oracon", "Badlan", "Lucmayan",
"Cauayan", "San Jose De Buenavista"), tmpkey = c("jp kishiwada",
"jp izumi", "jp tondabayashi", "jp kashihara", "jp habikino",
"jp kaizuka", "jp izumiotsu", "jp tenri", "jp tanabe", "jp kashiba",
"jp sennan", "jp sakurai", "jp hannan", "jp takaishi", "jp osakasayama",
"jp hashimoto", "jp iwade", "jp kainan", "jp sumoto", "jp gojo",
"jp gose", "jp tawaramoto", "jp gobo", "jp kawai", "jp kumano",
"jp haibara", "jp asuka", "jp awaji", "jp kamitonda", "jp kawachinagano",
"jp kimino", "jp koya", "jp kozagawa-cho", "jp minabe", "jp misaki",
"jp nachikatsuura", "jp nosegawa", "jp shirahama", "jp susami",
"jp taiji", "jp tenkawa", "jp uda", "jp yoshino", "jp yura",
"ph iloilo", "ph barotac nuevo", "ph trapiche", "ph tuyom", "ph inayawan",
"ph jordan", "ph alimodian", "ph guimbal", "ph dingle", "ph cabatuan",
"ph igbaras", "ph pavia", "ph cabano", "ph patnongon", "ph ungka",
"ph leon", "ph bulata", "ph tumcon", "ph caliling", "ph hamtic",
"ph belison", "ph buray", "ph cagbang", "ph masaling", "ph duenas",
"ph linaon", "ph bingawan", "ph maasin", "ph igang", "ph cartagena",
"ph tiling", "ph maribong", "ph napnapan", "ph zarraga", "ph concordia",
"ph new lucena", "ph dao", "ph aglalana", "ph bugasong", "ph alibunan",
"ph jamabalud", "ph egana", "ph calaya", "ph constancia", "ph pakiad",
"ph nueva valencia", "ph jibao-an", "ph mina", "ph bolilao",
"ph san enrique", "ph cordova", "ph lawigan", "ph piape", "ph aganan",
"ph ponong", "ph gines", "ph leganes", "ph jaguimitan", "ph east valencia",
"ph morobuan", "ph atabayan", "ph avila", "ph catungan", "ph ermita",
"ph igcocolo", "ph tiwi", "ph balibagan", "ph sulangan", "ph jalaud",
"ph tiring", "ph abangay", "ph guisijan", "ph abilay", "ph monpon",
"ph aureliana", "ph tigum", "ph quinagaringan", "ph abaca", "ph mapili",
"ph da-an", "ph cabilauan", "ph getulio", "ph pina", "ph oracon",
"ph badlan", "ph lucmayan", "ph cauayan", "ph san jose de buenavista"
), Population = c(205563, 189087, 132875, 126224, 121052, 92633,
80773, 71054, 69564, 69391, 66460, 62966, 60796, 60512, 57170,
57115, 55634, 43369, 39546, 34343, 32871, 32660, 27169, 20106,
19517, 18472, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 387748, 11641, 11539, 11117, 9963, 8255, 7302, 7232, 6171,
6106, 5974, 5928, 5812, 5810, 5598, 5172, 5151, 5067, 4840, 4816,
4711, 3863, 3856, 3829, 3705, 3669, 3657, 3514, 3468, 3396, 3332,
3308, 3258, 3253, 3091, 3059, 2893, 2868, 2828, 2798, 2742, 2723,
2713, 2713, 2689, 2681, 2677, 2665, 2651, 2620, 2588, 2571, 2543,
2522, 2513, 2476, 2463, 2461, 2444, 2442, 2362, 2349, 2318, 2304,
2297, 2269, 2263, 2262, 2239, 2232, 2207, 2196, 2195, 2189, 2172,
2168, 2153, 2139, 2104, 2085, 2077, 2063, 2054, 2045, 2041, 2024,
0, 0), latitude = c(34.467, 34.483, 34.5, 34.45, 34.534, 34.45,
34.5, 34.583, 33.733, 34.535, 34.348, 34.5, 34.333, 34.517, 34.517,
34.317, 34.25, 34.15, 34.35, 34.35, 34.45, 34.55, 33.883, 34.233,
33.904, 34.533, 34.48, 34.485, 33.691, 34.45, 34.187, 34.212,
33.536, 33.752, 34.304, 33.578, 34.118, 33.685, 33.553, 33.602,
34.269, 34.473, 34.365, 33.977, 10.697, 10.894, 10.684, 9.977,
9.9, 10.658, 10.821, 10.663, 10.999, 10.879, 10.716, 10.776,
10.587, 10.913, 10.75, 10.781, 9.86, 10.917, 9.98, 10.702, 10.838,
10.715, 10.7, 9.982, 11.067, 9.95, 11.233, 10.892, 10.916, 9.82,
9.974, 11.1, 10.708, 10.82, 10.508, 10.879, 10.515, 11.18, 11.044,
11.147, 10.879, 10.747, 10.492, 10.596, 10.7, 10.511, 10.75,
10.931, 10.862, 11.071, 10.73, 10.549, 10.729, 10.783, 11.083,
10.933, 10.787, 11.142, 10.668, 10.626, 10.683, 10.692, 10.771,
10.9, 10.69, 10.93, 10.8, 10.811, 10.893, 10.853, 10.967, 11.093,
10.733, 10.912, 10.885, 10.783, 11.122, 11.134, 11.108, 11.232,
10.861, 10.747, 10.64, 10.478, 11.141, 10.473, 9.844, 10.775),
longitude = c(135.367, 135.433, 135.6, 135.767, 135.583,
135.35, 135.4, 135.833, 135.367, 135.709, 135.268, 135.85,
135.25, 135.433, 135.563, 135.617, 135.317, 135.2, 134.9,
135.7, 135.733, 135.8, 135.15, 135.85, 136.122, 135.95, 135.82,
134.853, 135.408, 135.574, 135.491, 135.591, 135.79, 135.325,
135.159, 135.931, 135.616, 135.343, 135.479, 135.945, 135.881,
135.92, 135.862, 135.07, 122.564, 122.704, 122.432, 122.558,
122.434, 122.596, 122.431, 122.323, 122.671, 122.486, 122.266,
122.546, 122.7, 121.994, 122.55, 122.389, 122.402, 122.667,
122.481, 121.982, 121.96, 122.459, 122.499, 122.537, 122.619,
122.448, 122.567, 122.436, 122.639, 122.4, 122.654, 122.533,
122.393, 122.608, 122.55, 122.597, 121.946, 122.657, 122.066,
122.459, 122.621, 122.01, 122.626, 122.642, 122.467, 122.532,
122.5, 122.575, 122.747, 122.656, 122.401, 121.986, 121.972,
122.533, 122.626, 122.483, 122.589, 122.69, 122.71, 122.555,
122.417, 122.709, 122.015, 122.717, 122.319, 122.734, 122.517,
122.664, 122.748, 122.511, 122.65, 122.046, 122.5, 122.638,
121.977, 122.567, 122.588, 122.716, 122.739, 122.421, 122.573,
122.666, 122.638, 122.584, 122.52, 122.519, 122.383, 121.931
), distance = c(12.1033983706715, 18.3899116757047, 33.6008502207034,
47.9959144975438, 33.2380857241381, 10.0361570266128, 16.4170866375552,
56.6421123629711, 77.9765955762285, 44.2419380564569, 9.08236338629334,
56.1040561615424, 10.4929819561705, 19.9777109028803, 30.8756212480724,
36.3394509305111, 20.8088759721094, 31.0772190785225, 32.6932917835482,
42.6875344478932, 44.8846250107249, 52.7169940204742, 61.0971131463108,
59.6523210207629, 99.4873860406245, 65.7262133303869, 53.1011249132666,
36.3948919070862, 83.1987994010293, 30.3473005687047, 35.0284306100351,
39.8307891352326, 111.08591952332, 75.4085566830835, 15.7641843048534,
113.628384359301, 48.4569715473876, 82.9862748769935, 99.5353589260391,
112.155589656905, 61.030002562045, 62.1503754896085, 57.0858456256784,
52.5304774496695, 16.9678516007278, 23.9650848510695, 17.8622353497786,
95.3826087011773, 103.879664442242, 22.4451633033778, 6.93563300043443,
26.5105095696934, 26.7555548147453, 5.17234173583494, 28.0185700131311,
8.55316337771447, 35.4454504812041, 55.2125344446071, 11.1051435584871,
12.7732336937756, 108.581863736287, 21.121988580694, 94.7968710054594,
57.6888894096445, 58.2126963678581, 13.6398306134487, 14.7936754865746,
94.6852187356986, 29.3921004979045, 98.2460045794196, 45.1659952539469,
9.06298808500626, 18.3744243779213, 113.028151284642, 97.0551648543418,
29.980690030871, 17.6919106840833, 12.5955991980882, 36.6417019118572,
12.4111967318849, 69.435007170183, 42.4879227652078, 52.187645377298,
35.0905087142823, 14.8373207506573, 53.6199130632313, 40.567800302352,
30.9360522555993, 15.0583064589121, 36.0304676987285, 9.25329531036705,
14.0682148292176, 27.8678028846033, 31.84581492357, 15.2538294715758,
63.7579385401013, 58.0725944947238, 7.04357919640105, 31.3211844599149,
11.1674571270427, 11.6242282860789, 40.4839601053959, 29.9264324776362,
23.9679606967483, 18.6378718796601, 28.279040989485, 52.6660699937347,
25.5152625993047, 24.7961086588726, 28.3847007995859, 4.4854253785277,
18.7845096002248, 28.5769530854182, 2.93900551228868, 22.6670596650714,
56.7127941524606, 11.1375128973757, 18.0600792892105, 56.6455596931624,
9.77233832993428, 33.7306342063393, 41.3331620404634, 40.6417161096453,
45.0319001319029, 9.23116854868285, 21.130213848193, 26.6342321513515,
40.6723663618788, 34.3462448184846, 40.1029725617726, 110.559706361562,
61.7191590665721)), class = "data.frame", row.names = c(NA,
-132L))
edit4: it's working now :D
this code probably looks pretty bad but it works decently for my use case and it's not too slow
options("scipen"=100)
library(geosphere)
# split up data into regions
splitdt<-split(geocitylist[, -1],geocitylist$airport_code)
# function to get the total number of regions in the list
NROW(splitdt)
# function to get the total number of rows in a given region
NROW(splitdt[[1]])
## reduce cities
dat=geocitylist[FALSE,][]
currentregion=1
currentorigin=1
while (currentregion <= NROW(splitdt)){
workingregion <- as.data.frame(splitdt[[currentregion]]) ## set region
workingregion$remove = FALSE
setDT(workingregion)
plot(workingregion$longitude,workingregion$latitude)
while (currentorigin <= NROW(workingregion)) {
# choose which row to use
# as the first part of the distance formula
workingorigin <- workingregion[,c("longitude","latitude")] %>% slice(currentorigin) ## set LeadingRow city
setDT(workingorigin)
# calculate the distance from the specific row chosen
# and only keep ones which are further than 20km
workingregion<-workingregion %>% rowwise() %>% mutate(remove =
ifelse(distHaversine(c(longitude, latitude), workingorigin) != 0 & # keep workingorigin city
distHaversine(c(longitude, latitude), workingorigin) < 20000,TRUE,workingregion$remove))
# remove matched cities
workingregion <- workingregion[workingregion$remove!=TRUE,]
currentorigin = currentorigin+1
}
currentregion = currentregion+1
# save results
#dat <- workingregion
dat <- rbind(dat, workingregion, fill=TRUE)
}
plot(dat$longitude,dat$latitude)
but it doesn't work and I'm not sure why. It says subscript out of bounds
for (currentregion in seq_along(1:NROW(splitdt))){
workingregion <- splitdt[[currentregion]] ## set region
workingregion$remove = FALSE
for (currentorigin in seq_along(1:NROW(splitdt[[currentregion]]))){
# choose which row to use
# as the first part of the distance formula
LeadingRow <- workingregion[,c("longitude","latitude")] %>% slice(currentorigin) ## set LeadingRow city
workingregion <- workingregion[workingregion$remove!=TRUE,]
# calculate the distance from the specific row chosen
# and only keep ones which are further than 20km
workingregion<-workingregion %>% rowwise() %>% mutate(remove =
ifelse(distHaversine(c(longitude, latitude), LeadingRow) != 0 & # keep LeadingRow city
distHaversine(c(longitude, latitude), LeadingRow) < 20000,TRUE,workingregion$remove))
}
# save results
dat <- rbind(dat, workingregion)
}
now I just need to add loops and/or lapply I think...
# split up data into regions
splitdt<-split(geocitylist[, -1],geocitylist$airport_code)
# function to get the total number of regions in the list
NROW(splitdt)
# function to get the total number of rows in a given region
NROW(splitdt[[1]])
## loop here
workingregion <- splitdt[[1]] ## set loop iteration
workingregion$remove = FALSE
# choose which row to use as the first part of the distance formula
LeadingRow <- workingregion[,c("longitude","latitude")] %>% slice(1) ## set loop iteration
## another loop here
workingregion <- workingregion[workingregion$remove!=TRUE,]
# calculate the distance from the specific row chosen and only keep ones which are further than 20km
workingregion<-workingregion %>% rowwise() %>% mutate(remove =
ifelse(distHaversine(c(longitude, latitude), LeadingRow) != 0 & #keep LeadingRow city
distHaversine(c(longitude, latitude), LeadingRow) < 20000,TRUE,workingregion$remove))
Ah. I just realized this will undo the work of previous iterations because if it only appends then it will add rows back in which should have been removed so, rather than subsetting, I need to find a way to mark data as should be removed or perhaps even better, remove it from the list directly
I have half of an answer.
options("scipen"=100)
library(geosphere)
setDT(dt)
# split up data into regions
splitdt<-split(dt[, -1],dt$airport_code)
# function to get the total number of regions in the list
NROW(splitdt)
# function to get the total number of rows in a given region
NROW(splitdt[[1]])
# choose which row to use as the first part of the distance formula
LeadingRow = splitdt[[1]][,c("longitude","latitude")][1]
# calculate the distance from the specific row chosen and copy only ones which are further than 20km
df<-splitdt[[1]] %>% rowwise() %>% mutate(distanceFromLeadingRow = distHaversine(c(longitude, latitude), LeadingRow))
res<-subset(df,df$distanceFromLeadingRow > 20000)
now I just need to figure out how to iterate over each row and each region of all of the data and append it to a new dataframe.