I am trying to extract some variable names and numbers from the following vector and store them into two new variables:
unique_strings <- c("PM_1_PMS5003_S_Avg", "PM_2_5_PMS5003_S_Avg", "PM_10_PMS5003_S_Avg",
"PM_1_PMS5003_A_Avg", "PM_2_5_PMS5003_A_Avg", "PM_10_PMS5003_A_Avg",
"PNC_0_3_PMS5003_Avg", "PNC_0_5_PMS5003_Avg", "PNC_1_0_PMS5003_Avg",
"PNC_2_5_PMS5003_Avg", "PNC_5_0_PMS5003_Avg", "PNC_10_0_PMS5003_Avg",
"PM_1_PMS7003_S_Avg", "PM_2_5_PMS7003_S_Avg", "PM_10_PMS7003_S_Avg",
"PM_1_PMS7003_A_Avg", "PM_2_5_PMS7003_A_Avg", "PM_10_PMS7003_A_Avg",
"PNC_0_3_PMS7003_Avg", "PNC_0_5_PMS7003_Avg", "PNC_1_0_PMS7003_Avg",
"PNC_2_5_PMS7003_Avg", "PNC_5_0_PMS7003_Avg", "PNC_10_0_PMS7003_Avg"
)
I would like to extract each character before the PMS
for the first variable. This includes the strings that being with PM
or PNC
, as well as the underscores and digits. I would like to store these results into a variable called pollutant
.
Desired output:
unique(pollutant)
[1] "PM_1" "PM_2_5" "PM_10" "PNC_0_3" "PNC_0_5" "PNC_1_0" "PNC_2_5" "PNC_5_0" "PNC_10"
I would like to extract everything after the PMS
for the second variable.
For this, I first tried extracting just the model numbers (four-digit numbers ending in 003
) from each string, however, it would be useful to include the A_Avg
or S_Avg
in the extraction as well.
Here's my first attempt:
model_id <- str_extract(unique_strings, "[0-9]{4,}")
unique(model_id)
[1] "5003" "7003"
I have not used regex before and am having a difficult time navigating existing docs / stack posts. Your input is appreciated!
We can use str_split
to split the string based on "PMS"
. After that, use str_replace
to remove the last "_"
in the first column. The output is m
. The first variable is in the first column, while the second variable is in the second column.
library(stringr)
m <- str_split(unique_strings, pattern = "PMS", simplify = TRUE)
m[, 1] <- str_replace(m[, 1], "_$", "")
m
# [,1] [,2]
# [1,] "PM_1" "5003_S_Avg"
# [2,] "PM_2_5" "5003_S_Avg"
# [3,] "PM_10" "5003_S_Avg"
# [4,] "PM_1" "5003_A_Avg"
# [5,] "PM_2_5" "5003_A_Avg"
# [6,] "PM_10" "5003_A_Avg"
# [7,] "PNC_0_3" "5003_Avg"
# [8,] "PNC_0_5" "5003_Avg"
# [9,] "PNC_1_0" "5003_Avg"
# [10,] "PNC_2_5" "5003_Avg"
# [11,] "PNC_5_0" "5003_Avg"
# [12,] "PNC_10_0" "5003_Avg"
# [13,] "PM_1" "7003_S_Avg"
# [14,] "PM_2_5" "7003_S_Avg"
# [15,] "PM_10" "7003_S_Avg"
# [16,] "PM_1" "7003_A_Avg"
# [17,] "PM_2_5" "7003_A_Avg"
# [18,] "PM_10" "7003_A_Avg"
# [19,] "PNC_0_3" "7003_Avg"
# [20,] "PNC_0_5" "7003_Avg"
# [21,] "PNC_1_0" "7003_Avg"
# [22,] "PNC_2_5" "7003_Avg"
# [23,] "PNC_5_0" "7003_Avg"
# [24,] "PNC_10_0" "7003_Avg"