I have a dataframe similar to the following reproducible one in which one column contains HTML code:
ID <- c(15, 25, 90, 1, 23, 543)
HTML <- c("[demography_form][1]<div></table<text-align>[demography_form_date][1]", "<text-ali>[geography_form][1]<div></table<text-align>[geography_form_date][1]", "[social_isolation][1]<div></table<div><text-align>[social_isolation_date][1]", "<text-align>[geography_form][1]<div></table<text-align>[geography_form_date][1]", "<div>[demography_form][1]<div></table<text-align>[demography_form_date][1]", "[geography_form][1]<div></table<text-align>[geography_form_date][1]</table")
df <- data.frame(ID, HTML)
I would like to update the integer within the square brackets of the HTML
column to reflect each instance of repeat. For example, the second time that [demography_form] appears in a row, I would like the square brackets following it to be 2:
What's the best way of going about doing this? I was thinking of somehow creating an instance column and then using that to update the value in the square brackets, deleting it at the end? Thanks in advance.
Create a grouping column from the substring inside the []
from HTML column, replace the digits inside the []
with the sequence of rows (row_number()
) using str_replace_all
library(dplyr)
library(stringr)
df %>%
group_by(grp = str_extract(HTML, "\\[(\\w+)\\]", group =1)) %>%
mutate(HTML = str_replace_all(HTML, "\\[(\\d+)\\]",
sprintf("[%d]", row_number()))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 6 × 2
ID HTML
<dbl> <chr>
1 15 [demography_form][1]<div></table<text-align>[demography_form_date][1]
2 25 <text-ali>[geography_form][1]<div></table<text-align>[geography_form_date][1]
3 90 [social_isolation][1]<div></table<div><text-align>[social_isolation_date][1]
4 1 <text-align>[geography_form][2]<div></table<text-align>[geography_form_date][2]
5 23 <div>[demography_form][2]<div></table<text-align>[demography_form_date][2]
6 543 [geography_form][3]<div></table<text-align>[geography_form_date][3]</table