Search code examples
rformatdata-cleaning

How to reformat similar text for merging in R?


I am working with the NYC open data, and I am wanting tho merge two data frames based on community board. The issue is, the two data frames have slightly different ways of representing this. I have provided an example of the two different formats below.

CommunityBoards <- data.frame(FormatOne = c("01 BRONX", "05 QUEENS", "15 BROOKLYN", "03 STATEN ISLAND"),
                              FormatTwo = c("BRONX COMMUNITY BOARD #1", "QUEENS COMMUNITY BOARD #5",
                                            "BROOKLYN COMMUNITY BOARD #15", "STATEN ISLAND COMMUNITY BD #3"))

Along with the issue of the placement of the numbers and the "#", the second data frame shortens "COMMUNITY BOARD" to "COMMUNITY BD" just on Staten Island. I don't have a strong preference of what string looks like, so long as I can discern borough and community board number. What would be the easiest way to reformat one or both of these strings so I could merge the two sets?

Thank you for any and all help!


Solution

  • You can use regex to get out just the district numbers. For the first format, the only thing that matters is the begining of the string before the space, hence you could do

    districtsNrs1 <- as.numeric(gsub("(\\d+) .*","\\1",CommunityBoards$FormatOne))
    

    For the second, I assume that the formats look like "something HASHTAG number", hence you could do

    districtsNrs2 <- as.numeric(gsub(".* #(\\d+)","\\1",CommunityBoards$FormatTwo))
    

    to get the pure district numbers.

    Now you know how to extract the district numbers. With that information, you can name/reformat the district-names how you want.

    To know which district number is which district, you can create a translation data.frame between the districts and numbers like

    districtNumberTranslations <- data.frame(
      districtNumber = districtsNrs2,
      districtName = sapply(strsplit(CommunityBoards$FormatTwo," COMMUNITY "),"[[",1)
    )
    

    giving

    #  districtNumber  districtName
    #1              1         BRONX
    #2              5        QUEENS
    #3             15      BROOKLYN
    #4              3 STATEN ISLAND