Search code examples
rregexgrepl

Extract Cyrillic letters from string


The function below extracts any letters from the english alphabet from a string.

is.letter <- function(x) grepl("[:alpha:]", x) 

I want to build a similar function that extracts only cyrlic letters from a given string.

Update:

With the code provided by Wiktor Stribiżew, I get the following results:

grepl("\\p{Cyrillic}", x, perl=TRUE)

test[, c(2, 11)]
      MOBILE_NUMBER contain_cyrlic
    1  НЕМА ТЕЛЕФОН          FALSE
    2      НЕПОЗНАТ          FALSE
    3  НЕМА ТЕЛЕФОН          FALSE
    4  НЕМА ТЕЛЕФОН          FALSE

Any ideas?


Solution

  • stringi might provide more cross-platform/system consistent results for you but both stri_detect_regex and grepl (in "perl" mode) should do the trick:

    library(stringi)
    library(dplyr)
    
    data_frame(
      MOBILE_NUMBER = c("НЕМА ТЕЛЕФОН", "НЕПОЗНАТ", "НЕМА ТЕЛЕФОН", "НЕМА ТЕЛЕФОН")
    ) -> tst
    
    tst
    ## # A tibble: 4 × 1
    ##   MOBILE_NUMBER
    ##           <chr>
    ## 1  НЕМА ТЕЛЕФОН
    ## 2      НЕПОЗНАТ
    ## 3  НЕМА ТЕЛЕФОН
    ## 4  НЕМА ТЕЛЕФОН
    
    (t1 <- mutate(tst, is_cyrillic = grepl("\\p{Cyrillic}", MOBILE_NUMBER, perl=TRUE)))
    ## # A tibble: 4 × 2
    ##   MOBILE_NUMBER is_cyrillic
    ##           <chr>       <lgl>
    ## 1  НЕМА ТЕЛЕФОН        TRUE
    ## 2      НЕПОЗНАТ        TRUE
    ## 3  НЕМА ТЕЛЕФОН        TRUE
    ## 4  НЕМА ТЕЛЕФОН        TRUE
    
    (t2 <- mutate(tst, is_cyrillic = stri_detect_regex(MOBILE_NUMBER, "\\p{Cyrillic}")))
    
    ## # A tibble: 4 × 2
    ##   MOBILE_NUMBER is_cyrillic
    ##           <chr>       <lgl>
    ## 1  НЕМА ТЕЛЕФОН        TRUE
    ## 2      НЕПОЗНАТ        TRUE
    ## 3  НЕМА ТЕЛЕФОН        TRUE
    ## 4  НЕМА ТЕЛЕФОН        TRUE
    
    identical(t1, t2)
    ## [1] TRUE