I have multiple addresses I want to group together and create a tally for. However they have variation in the formats. I've geocoded the addresses and plan to group them using the geocodes however when grouping them I want to create a new variable which retains at least one version of the address (or multiple variables with each address in the group in a wide format, but I'd settle for just one variable for each group with one address retained).
Here's some example data.
address=c("big fake plaza, 12 this street,district, city",
"Green mansion, district, city",
"Block 7 of orange building district, city",
"98 main street block a blue plaza, city",
"blue red mansion, 46 pearl street, city",
"12 this street, big fake plaza, district, city",
"Green mansion, district, city",
"orange building Block 7 district, city",
"block a 98 main street blue plaza, city",
"blue red mansion, 46 pearl street, city"
"big fake plaza, district, city",
"Green mansion,city")
long =c("112.8838", "111.9154", "114.9318", "116.9318", "112.9320","111.9324",
"112.8838", "111.9154", "114.9318", "116.9318", "112.9320","111.9324",
"112.8838", "111.9154")
lat = c("21.22177", "12.22177", "26.27743", "23.17651", "23.24769", "23.24771",
"21.22177", "12.22177", "26.27743", "23.17651", "23.24769", "23.24771",
"21.22177", "12.22177")
df<-cbind(address, lat, long)
Where I've got to is grouping and counting but dont know how to mutate and create a naming variable based off just one address in the group.
df_agg<- df %>%
group_by(long,lat) %>%
summarise(count = n()) %>%
mutate(bldg = ifelse(address[address==1],address, NA )) ???????
what i'd like for it to look like would be something like this
long lat count bldg
<dbl> <dbl> <int> <chr>
1 112. 21.2 3 "big fake plaza, 12 this street,district, city"
2 114. 12.2 3 "Green mansion, district, city"
3 116. 26.3 2 "98 main street block a blue plaza, city"
4 112. 23.5 2 "Block 7 of orange building district, city"
5 111. 23.5 2 "blue red mansion, 46 pearl street, city"
Obviously we cant group on the address names bcause there is varitaion between the strings. Happy to hear any other suggestions if there are better options. If we could create new variables bldg1 blgd2 ect. for each varying building name in each group that be great to but not a priority.
Thanks in advance.
You can select the first address in each location.
library(dplyr)
library(tidyr)
df %>%
group_by(long,lat) %>%
summarise(count = n(),
address = first(address)) %>%
ungroup
# long lat count address
# <chr> <chr> <int> <chr>
#1 111.9154 12.22177 3 Green mansion, district, city
#2 111.9324 23.24771 2 12 this street, big fake plaza, district, city
#3 112.8838 21.22177 3 big fake plaza, 12 this street,district, city
#4 112.9320 23.24769 2 blue red mansion, 46 pearl street, city
#5 114.9318 26.27743 2 Block 7 of orange building district, city
#6 116.9318 23.17651 2 98 main street block a blue plaza, city
If you want to create separate columns like bldg1
, bldg2
etc cast the data in wide format.
df %>%
group_by(long,lat) %>%
mutate(row = paste0('bldg', row_number()),
count = n()) %>%
ungroup %>%
pivot_wider(names_from = row, values_from = address)