Search code examples
raggregategroupingcontains

R: If name includes specific text, then group it


I am using Energy Performance Certificate Data to identify the heating fuel types of buildings in an area however, they are split into over 60 different subsets of 9 main fuel types. I would like add another column for fuel type so that they can be grouped by the 9 main fuel types.

An example of the relevant columns of the data is:

BuildingID <- c(1,2,3,4,5,6,7,8,9,10)

MainHeatDesc <- c("Boiler and radiators, mains gas", "Boiler and radiators, oil", "Room heaters, electric", "Room heaters, LPG", "Air source heat pump, underfloor heating, electric", "Air source heat pump, fan coil units, electric", "Ground source heat pump, mains gas", "Electric storage heaters", "Room heaters, wood logs", "Boilers and radiators, wood chips")

data <- data.frame(BuildingID, MainHeatDesc)

This is a small example taking some of the subsets of the original data. In this example I would like to create another column for main fuel type grouping them as: Mains gas, Oil, Electric, LPG and wood.

The end result should look like this:

# BuildingID            MainHeatDesc                             MainFuelType
#     1       Boiler and radiators, mains gas                      Mains gas
#     2       Boiler and radiators, oil                              Oil
#     3       Room heaters, electric                               Electric
#     4       Room heaters, LPG                                      LPG 
#     5       Air source heat pump, underfloor heating, electric   Electric
#     6       Air source heat pump, fan coil units, electric       Electric
#     7       Ground source heat pump, mains gas                   Mains Gas
#     8       Electric storage heaters                             Electric
#     9       Room heaters, wood logs                                Wood
#    10       Boilers and radiators, wood chips                      Wood

If anyone is able to help me out it would be greatly appreciated. Let me know if you have any queries or need any more information.

Thanks!


Solution

  • Similar logic as @tmfmnk but in base R using sub.

    types <- c('Mains Gas', 'Oil', 'Electric', 'LPG', 'Wood')
    data$MainFuelType <- sub(paste0(".*(?i)(", paste0("\\b", types, "\\b", 
                            collapse = "|"), ").*"), "\\1", data$MainHeatDesc)
    
    data
    #   BuildingID                                       MainHeatDesc MainFuelType
    #1           1                    Boiler and radiators, mains gas    mains gas
    #2           2                          Boiler and radiators, oil          oil
    #3           3                             Room heaters, electric     electric
    #4           4                                  Room heaters, LPG          LPG
    #5           5 Air source heat pump, underfloor heating, electric     electric
    #6           6     Air source heat pump, fan coil units, electric     electric
    #7           7                 Ground source heat pump, mains gas    mains gas
    #8           8                           Electric storage heaters     Electric
    #9           9                            Room heaters, wood logs         wood
    #10         10                  Boilers and radiators, wood chips         wood
    

    where the regex which is generated dynamically looks like :

    paste0(".*(?i)(", paste0("\\b", types, "\\b", collapse = "|"), ").*")
    #[1] ".*(?i)(\\bMains Gas\\b|\\bOil\\b|\\bElectric\\b|\\bLPG\\b|\\bWood\\b).*"
    

    where (?i) is for case-insensitivity.