Search code examples
regexr

How to fill gap between two characters with regex


I have a data set like below. I would like to replace all dots between two 1's with 1's, as shown in the desired.result. Can I do this with regex in base R?

I tried:

regexpr("^1\\.1$", my.data$my.string, perl = TRUE)

Here is a solution in c#

Characters between two exact characters

Thank you for any suggestions.

my.data <- read.table(text='
     my.string                           state
     ................1...............1.    A
     ......1..........................1    A
     .............1.....2..............    B
     ......1.................1...2.....    B
     ....1....2........................    B
     1...2.............................    C
     ..........1....................1..    C
     .1............................1...    C
     .................1...........1....    C
     ........1....2....................    C
     ......1........................1..    C
     ....1....1...2....................    D
     ......1....................1......    D
     .................1...2............    D
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)

desired.result <- read.table(text='
     my.string                           state
     ................11111111111111111.    A
     ......1111111111111111111111111111    A
     .............1.....2..............    B
     ......1111111111111111111...2.....    B
     ....1....2........................    B
     1...2.............................    C
     ..........1111111111111111111111..    C
     .111111111111111111111111111111...    C
     .................1111111111111....    C
     ........1....2....................    C
     ......11111111111111111111111111..    C
     ....111111...2....................    D
     ......1111111111111111111111......    D
     .................1...2............    D
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)

Solution

  • Below is an option using gsub with the \G feature and lookaround assertions.

    > gsub('(?:\\G(?!^)|\\.*1(?=\\.+1))\\K\\.', '1', my.data$my.string, perl = TRUE)
    # [1] "................11111111111111111." "......1111111111111111111111111111"
    # [3] ".............1.....2.............." "......1111111111111111111...2....."
    # [5] "....1....2........................" "1...2............................."
    # [7] "..........1111111111111111111111.." ".111111111111111111111111111111..."
    # [9] ".................1111111111111...." "........1....2...................."
    # [11] "......11111111111111111111111111.." "....111111...2...................."
    # [13] "......1111111111111111111111......" ".................1...2............"
    

    The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. Since it seems you want to avoid the dots at the start of the string position we use a lookaround assertion \G(?<!^) to exclude the start of the string.

    The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included.

    You can find an overall breakdown that explains the regular expression here.