Search code examples
rregexstringextractstringr

Extract non-specific name form string address, ignoring specific patterns


I have long addresses some with just general building names in various locations that I am trying to extract. I have determined how to extract the more standardised parts of the addresses but am stuck trying to get out the general names.

example data.

addresses<-c("big fake plaza, 12 this street, district, city", 
"Green mansion, district, city", 
 "Block 7c of orange building  district, city",
"98 main street block a blue plaza, city",
"tower 10, caribbean coast, district",
"block 3a, the latitude, city", 
 "blue red mansion, 46 pearl street, city"
"dorsett hotel, city"
"block 9, Willowland, disctrict, city",
 tower 2, the coronation, 1 fake street, district")

The goal is to extract the non-specific building names, and only them. The plan in the code was to extract words that were not preceded by generic building names, as well to ignore any block or tower names.

what I have

df$add.gen<-str_extract(df$address,""^[^block|^tower](([a-z]+\\s+[a-z]*\\s*[a-z]*\\s*[a-z]*\\s*[a-z]*))(?!building)(?!mansion)(?!garden)(?!house)")

But its not working clearly

what im aiming to get

df$add.gen<-

(NA, 
NA, 
NA,
NA,
"caribbean coast",
"the latitude", 
"dorsett hotel"
"Willowland",
"the coronation")

Thanks in advance!!


Solution

  • You can use

    df$add.gen <- trimws(str_extract(df$address, "(?i)(?<=,|^)(?:(?!\\b(?:city|disc?trict|street|plaza|square|tower|block|mansion|garden|house|building)\\b)[^,])*(?=,|$)"))
    

    See the regex demo

    Details:

    • (?i) - matching is case insensitive
    • (?<=,|^) - immediately on the left, there must be a comma or start of string
    • (?:(?!\b(?:city|disc?trict|street|plaza|square|tower|block|mansion|garden|house|building)\b)[^,])* - any char but a comma, zero or more occurrences (as many as possible), that is not a starting char of the following whole words: city, disctrict, district, street, plaza, square, tower, block, mansion, garden, house, building
    • (?=,|$) - immediately on the right, there must be a comma or end of string.

    The trimws is necessary to remove leading/trailing spaces.