r regex substring backreference capturing-group

R sub with back reference not replacing properly

I am attempting to extract a string from some file names to use as a variable later.

The file names look like this:

c("./Vote/Академический vote 1.xls", "./Vote/Академический vote 2.xls", 
"./Vote/Академический vote 3.xls", "./Vote/Алексеевский в городе Москве vote 1.xls", 
"./Vote/Алексеевский в городе Москве vote 2.xls", "./Vote/Алтуфьевский vote 1.xls", 
"./Vote/Алтуфьевский vote 2.xls", "./Vote/Алтуфьевский vote 3.xls", 
"./Vote/Арбат vote 1.xls", "./Vote/Арбат vote 2.xls", "./Vote/Аэропорт vote 1.xls", 
"./Vote/Аэропорт vote 2.xls", "./Vote/Аэропорт vote 3.xls", "./Vote/Бабушкинский vote 1.xls", 
"./Vote/Бабушкинский vote 2.xls", "./Vote/Басманный vote 1.xls", 
"./Vote/Басманный vote 2.xls", "./Vote/Басманный vote 3.xls", 
"./Vote/Беговой vote 1.xls", "./Vote/Беговой vote 2.xls", "./Vote/Бескудниковский vote 1.xls", 
"./Vote/Бескудниковский vote 2.xls", "./Vote/Бибирево vote 1.xls", 
"./Vote/Бибирево vote 2.xls", "./Vote/Бибирево vote 3.xls")
> dput(sample(vote_files, size = 25))
c("./Vote/Лианозово vote 2.xls", "./Vote/Зюзино vote 1.xls", 
"./Vote/Восточное Дегунино vote 2.xls", "./Vote/Аэропорт vote 2.xls", 
"./Vote/Академический vote 1.xls", "./Vote/Замоскворечье в городе Москве vote 1.xls", 
"./Vote/Обручевский vote 2.xls", "./Vote/Даниловский vote 3.xls", 
"./Vote/Нагатино-Садовники vote 1.xls", "./Vote/Ново-Переделкино в городе Москве vote 1.xls", 
"./Vote/Кунцево vote 2.xls", "./Vote/Текстильщики в городе Москве vote 2.xls", 
"./Vote/Южное Медведково vote 1.xls", "./Vote/Западное Дегунино vote 2.xls", 
"./Vote/Хамовники vote 1.xls", "./Vote/Крюково vote 1.xls", "./Vote/Беговой vote 1.xls", 
"./Vote/Восточный vote 1.xls", "./Vote/Богородское vote 2.xls", 
"./Vote/Некрасовка vote 2.xls", "./Vote/Косино-Ухтомский vote 1.xls", 
"./Vote/Лосиноостровский vote 3.xls", "./Vote/Хорошевский vote 2.xls", 
"./Vote/Бирюлево Западное vote 2.xls", "./Vote/Гольяново vote 3.xls"
)

I am attempting to extract the Russian text between the /Vote/ and the /vote #.xls using sub as follows

sub(x= string, pattern = ".*((?<=.//Vote//).*(?=vote)).*", replacement = "\\1", perl = T)

I have to use lookarounds because the string I want to extract is sometimes more than one word. However, despite the capturing group appearing to capture the right text when I verify on an online regex tester, the sub call just returns the exact same string I put in.

What's the issue here? Alternatively, is there a simpler way to do this?

Solution

As mentioned in the comments under the question your regular expression would work if the double slashes were single slashes (and although not mentioned there also 'vote' were replaced with ' vote', i.e. with a space before it).

Regarding a simpler way to do it, basename will get the filename part after which we can replace the space followed by vote and everything after it with the empty string:

sub(" vote.*", "", basename(x))

giving:

 [1] "Лианозово"                        "Зюзино"                          
 [3] "Восточное Дегунино"               "Аэропорт"                        
 [5] "Академический"                    "Замоскворечье в городе Москве"   
 [7] "Обручевский"                      "Даниловский"                     
 [9] "Нагатино-Садовники"               "Ново-Переделкино в городе Москве"
[11] "Кунцево"                          "Текстильщики в городе Москве"    
[13] "Южное Медведково"                 "Западное Дегунино"               
[15] "Хамовники"                        "Крюково"                         
[17] "Беговой"                          "Восточный"                       
[19] "Богородское"                      "Некрасовка"                      
[21] "Косино-Ухтомский"                 "Лосиноостровский"                
[23] "Хорошевский"                      "Бирюлево Западное"               
[25] "Гольяново"

Update: Handle phrases with embedded spaces.