I have long addresses some with just general building names in various locations that I am trying to extract. I have determined how to extract the more standardised parts of the addresses but am stuck trying to get out the general names.
example data.
addresses<-c("big fake plaza, 12 this street, district, city",
"Green mansion, district, city",
"Block 7c of orange building district, city",
"98 main street block a blue plaza, city",
"tower 10, caribbean coast, district",
"block 3a, the latitude, city",
"blue red mansion, 46 pearl street, city"
"dorsett hotel, city"
"block 9, Willowland, disctrict, city",
tower 2, the coronation, 1 fake street, district")
The goal is to extract the non-specific building names, and only them. The plan in the code was to extract words that were not preceded by generic building names, as well to ignore any block or tower names.
what I have
df$add.gen<-str_extract(df$address,""^[^block|^tower](([a-z]+\\s+[a-z]*\\s*[a-z]*\\s*[a-z]*\\s*[a-z]*))(?!building)(?!mansion)(?!garden)(?!house)")
But its not working clearly
what im aiming to get
df$add.gen<-
(NA,
NA,
NA,
NA,
"caribbean coast",
"the latitude",
"dorsett hotel"
"Willowland",
"the coronation")
Thanks in advance!!
You can use
df$add.gen <- trimws(str_extract(df$address, "(?i)(?<=,|^)(?:(?!\\b(?:city|disc?trict|street|plaza|square|tower|block|mansion|garden|house|building)\\b)[^,])*(?=,|$)"))
See the regex demo
Details:
(?i)
- matching is case insensitive(?<=,|^)
- immediately on the left, there must be a comma or start of string(?:(?!\b(?:city|disc?trict|street|plaza|square|tower|block|mansion|garden|house|building)\b)[^,])*
- any char but a comma, zero or more occurrences (as many as possible), that is not a starting char of the following whole words: city
, disctrict
, district
, street
, plaza
, square
, tower
, block
, mansion
, garden
, house
, building
(?=,|$)
- immediately on the right, there must be a comma or end of string.The trimws
is necessary to remove leading/trailing spaces.