Search code examples
rstringdataframedplyrinstance

Updating integer within string row for each instance in R dataframe


I have a dataframe similar to the following reproducible one in which one column contains HTML code:

ID <- c(15, 25, 90, 1, 23, 543)

HTML <- c("[demography_form][1]<div></table<text-align>[demography_form_date][1]", "<text-ali>[geography_form][1]<div></table<text-align>[geography_form_date][1]", "[social_isolation][1]<div></table<div><text-align>[social_isolation_date][1]", "<text-align>[geography_form][1]<div></table<text-align>[geography_form_date][1]", "<div>[demography_form][1]<div></table<text-align>[demography_form_date][1]", "[geography_form][1]<div></table<text-align>[geography_form_date][1]</table")

df <- data.frame(ID, HTML)

enter image description here

I would like to update the integer within the square brackets of the HTML column to reflect each instance of repeat. For example, the second time that [demography_form] appears in a row, I would like the square brackets following it to be 2:

enter image description here

What's the best way of going about doing this? I was thinking of somehow creating an instance column and then using that to update the value in the square brackets, deleting it at the end? Thanks in advance.


Solution

  • Create a grouping column from the substring inside the [] from HTML column, replace the digits inside the [] with the sequence of rows (row_number()) using str_replace_all

    library(dplyr)
    library(stringr)
    df %>% 
      group_by(grp = str_extract(HTML, "\\[(\\w+)\\]", group =1)) %>% 
      mutate(HTML = str_replace_all(HTML, "\\[(\\d+)\\]", 
         sprintf("[%d]", row_number()))) %>% 
      ungroup %>%
      select(-grp)
    

    -output

    # A tibble: 6 × 2
         ID HTML                                                                           
      <dbl> <chr>                                                                          
    1    15 [demography_form][1]<div></table<text-align>[demography_form_date][1]          
    2    25 <text-ali>[geography_form][1]<div></table<text-align>[geography_form_date][1]  
    3    90 [social_isolation][1]<div></table<div><text-align>[social_isolation_date][1]   
    4     1 <text-align>[geography_form][2]<div></table<text-align>[geography_form_date][2]
    5    23 <div>[demography_form][2]<div></table<text-align>[demography_form_date][2]     
    6   543 [geography_form][3]<div></table<text-align>[geography_form_date][3]</table