Search code examples
rregexsubstringbackreferencecapturing-group

R sub with back reference not replacing properly


I am attempting to extract a string from some file names to use as a variable later.

The file names look like this:

c("./Vote/Академический vote 1.xls", "./Vote/Академический vote 2.xls", 
"./Vote/Академический vote 3.xls", "./Vote/Алексеевский в городе Москве vote 1.xls", 
"./Vote/Алексеевский в городе Москве vote 2.xls", "./Vote/Алтуфьевский vote 1.xls", 
"./Vote/Алтуфьевский vote 2.xls", "./Vote/Алтуфьевский vote 3.xls", 
"./Vote/Арбат vote 1.xls", "./Vote/Арбат vote 2.xls", "./Vote/Аэропорт vote 1.xls", 
"./Vote/Аэропорт vote 2.xls", "./Vote/Аэропорт vote 3.xls", "./Vote/Бабушкинский vote 1.xls", 
"./Vote/Бабушкинский vote 2.xls", "./Vote/Басманный vote 1.xls", 
"./Vote/Басманный vote 2.xls", "./Vote/Басманный vote 3.xls", 
"./Vote/Беговой vote 1.xls", "./Vote/Беговой vote 2.xls", "./Vote/Бескудниковский vote 1.xls", 
"./Vote/Бескудниковский vote 2.xls", "./Vote/Бибирево vote 1.xls", 
"./Vote/Бибирево vote 2.xls", "./Vote/Бибирево vote 3.xls")
> dput(sample(vote_files, size = 25))
c("./Vote/Лианозово vote 2.xls", "./Vote/Зюзино vote 1.xls", 
"./Vote/Восточное Дегунино vote 2.xls", "./Vote/Аэропорт vote 2.xls", 
"./Vote/Академический vote 1.xls", "./Vote/Замоскворечье в городе Москве vote 1.xls", 
"./Vote/Обручевский vote 2.xls", "./Vote/Даниловский vote 3.xls", 
"./Vote/Нагатино-Садовники vote 1.xls", "./Vote/Ново-Переделкино в городе Москве vote 1.xls", 
"./Vote/Кунцево vote 2.xls", "./Vote/Текстильщики в городе Москве vote 2.xls", 
"./Vote/Южное Медведково vote 1.xls", "./Vote/Западное Дегунино vote 2.xls", 
"./Vote/Хамовники vote 1.xls", "./Vote/Крюково vote 1.xls", "./Vote/Беговой vote 1.xls", 
"./Vote/Восточный vote 1.xls", "./Vote/Богородское vote 2.xls", 
"./Vote/Некрасовка vote 2.xls", "./Vote/Косино-Ухтомский vote 1.xls", 
"./Vote/Лосиноостровский vote 3.xls", "./Vote/Хорошевский vote 2.xls", 
"./Vote/Бирюлево Западное vote 2.xls", "./Vote/Гольяново vote 3.xls"
)

I am attempting to extract the Russian text between the /Vote/ and the /vote #.xls using sub as follows

sub(x= string, pattern = ".*((?<=.//Vote//).*(?=vote)).*", replacement = "\\1", perl = T)

I have to use lookarounds because the string I want to extract is sometimes more than one word. However, despite the capturing group appearing to capture the right text when I verify on an online regex tester, the sub call just returns the exact same string I put in.

What's the issue here? Alternatively, is there a simpler way to do this?


Solution

  • As mentioned in the comments under the question your regular expression would work if the double slashes were single slashes (and although not mentioned there also 'vote' were replaced with ' vote', i.e. with a space before it).

    Regarding a simpler way to do it, basename will get the filename part after which we can replace the space followed by vote and everything after it with the empty string:

    sub(" vote.*", "", basename(x))
    

    giving:

     [1] "Лианозово"                        "Зюзино"                          
     [3] "Восточное Дегунино"               "Аэропорт"                        
     [5] "Академический"                    "Замоскворечье в городе Москве"   
     [7] "Обручевский"                      "Даниловский"                     
     [9] "Нагатино-Садовники"               "Ново-Переделкино в городе Москве"
    [11] "Кунцево"                          "Текстильщики в городе Москве"    
    [13] "Южное Медведково"                 "Западное Дегунино"               
    [15] "Хамовники"                        "Крюково"                         
    [17] "Беговой"                          "Восточный"                       
    [19] "Богородское"                      "Некрасовка"                      
    [21] "Косино-Ухтомский"                 "Лосиноостровский"                
    [23] "Хорошевский"                      "Бирюлево Западное"               
    [25] "Гольяново"                       
    

    Update: Handle phrases with embedded spaces.