Search code examples
rdataframestringi

Splitting column with differing syntax in R


I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address.

As seen below the address is in one string.

Head(DF, 3)

Address                            Price        m2        Rooms   
Petersvej   1772900  Hoersholm     10.000       210       5        
Annasvej    2B2900   Hoersholm     15.000       230       4        
Krænsvej    125800   Lyngby C      10.000       210       5               
A Mivs Alle 119800   Hjoerring      1.300        70       3        

The syntax for the address column is: road name, road no., followed by a 4 digit postalcode and the city name (sometimes two words).

Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..

Any pointers are very much appreciated


Solution

  • 1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.

    library(dplyr)
    library(tidyr)
    
    DF %>% 
      separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
      separate(Number, into = c("StreetNo", "Postal"), sep = -4)
    

    giving:

           Road StreetNo Postal      City Price  m2 Rooms      CITY
    1 Petersvej       77   2900 Hoersholm    10 210     5 Hoersholm
    2  Annasvej     121B   2900 Hoersholm    15 230     4 Hoersholm
    3  Krænsvej       12   5800  Lyngby C    10 210     5         C
    

    2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.

    DF %>% 
      mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
      separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")
    

    Note

    The input DF in reproducible form is:

    DF <- 
    structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm", 
    "Krænsvej 125800   Lyngby C", "Petersvej 772900  Hoersholm"), class = "factor"), 
        Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L, 
        4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C", 
        "Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA, 
    -3L))
    

    Update

    Added and fixed (2).