Search code examples
rregexsplitstringrstringi

Extract variable names using stringr in R


I am trying to extract some variable names and numbers from the following vector and store them into two new variables:

unique_strings <- c("PM_1_PMS5003_S_Avg", "PM_2_5_PMS5003_S_Avg", "PM_10_PMS5003_S_Avg", 
  "PM_1_PMS5003_A_Avg", "PM_2_5_PMS5003_A_Avg", "PM_10_PMS5003_A_Avg", 
  "PNC_0_3_PMS5003_Avg", "PNC_0_5_PMS5003_Avg", "PNC_1_0_PMS5003_Avg", 
  "PNC_2_5_PMS5003_Avg", "PNC_5_0_PMS5003_Avg", "PNC_10_0_PMS5003_Avg", 
  "PM_1_PMS7003_S_Avg", "PM_2_5_PMS7003_S_Avg", "PM_10_PMS7003_S_Avg", 
  "PM_1_PMS7003_A_Avg", "PM_2_5_PMS7003_A_Avg", "PM_10_PMS7003_A_Avg", 
  "PNC_0_3_PMS7003_Avg", "PNC_0_5_PMS7003_Avg", "PNC_1_0_PMS7003_Avg", 
  "PNC_2_5_PMS7003_Avg", "PNC_5_0_PMS7003_Avg", "PNC_10_0_PMS7003_Avg"
)

I would like to extract each character before the PMS for the first variable. This includes the strings that being with PM or PNC, as well as the underscores and digits. I would like to store these results into a variable called pollutant.

Desired output:

unique(pollutant)
[1] "PM_1" "PM_2_5" "PM_10" "PNC_0_3" "PNC_0_5" "PNC_1_0" "PNC_2_5" "PNC_5_0" "PNC_10"

I would like to extract everything after the PMS for the second variable.

For this, I first tried extracting just the model numbers (four-digit numbers ending in 003) from each string, however, it would be useful to include the A_Avg or S_Avg in the extraction as well.

Here's my first attempt:

model_id <- str_extract(unique_strings, "[0-9]{4,}")

unique(model_id)
[1] "5003" "7003"

I have not used regex before and am having a difficult time navigating existing docs / stack posts. Your input is appreciated!


Solution

  • We can use str_split to split the string based on "PMS". After that, use str_replace to remove the last "_" in the first column. The output is m. The first variable is in the first column, while the second variable is in the second column.

    library(stringr)
    m <- str_split(unique_strings, pattern = "PMS", simplify = TRUE)
    m[, 1] <- str_replace(m[, 1], "_$", "")
    m
    #       [,1]       [,2]        
    #  [1,] "PM_1"     "5003_S_Avg"
    #  [2,] "PM_2_5"   "5003_S_Avg"
    #  [3,] "PM_10"    "5003_S_Avg"
    #  [4,] "PM_1"     "5003_A_Avg"
    #  [5,] "PM_2_5"   "5003_A_Avg"
    #  [6,] "PM_10"    "5003_A_Avg"
    #  [7,] "PNC_0_3"  "5003_Avg"  
    #  [8,] "PNC_0_5"  "5003_Avg"  
    #  [9,] "PNC_1_0"  "5003_Avg"  
    # [10,] "PNC_2_5"  "5003_Avg"  
    # [11,] "PNC_5_0"  "5003_Avg"  
    # [12,] "PNC_10_0" "5003_Avg"  
    # [13,] "PM_1"     "7003_S_Avg"
    # [14,] "PM_2_5"   "7003_S_Avg"
    # [15,] "PM_10"    "7003_S_Avg"
    # [16,] "PM_1"     "7003_A_Avg"
    # [17,] "PM_2_5"   "7003_A_Avg"
    # [18,] "PM_10"    "7003_A_Avg"
    # [19,] "PNC_0_3"  "7003_Avg"  
    # [20,] "PNC_0_5"  "7003_Avg"  
    # [21,] "PNC_1_0"  "7003_Avg"  
    # [22,] "PNC_2_5"  "7003_Avg"  
    # [23,] "PNC_5_0"  "7003_Avg"  
    # [24,] "PNC_10_0" "7003_Avg"