Search code examples
ruppercasecapslock

How to count CAPSLOCK in string using R


In src$Review each row is filled with text in Russian. I want to count the CAPSLOCK in each row. For example, in "My apple is GREEN" I want to count not just the quantity of capital letters, but the amount of CAPSLOCK (without "My", only "GREEN"). So, it works only if at least two characters are presented in uppercase.

Now I have following code in my script:

capscount <- str_count(src$Review, "[А-Я]")

It counts only the total amount of capital letters. I only need the total amount of characters that are in CAPSLOCK, which means that these characters are counted only if at least 2 following letters in a word (e.g., "GR" in "GREEN") are displayed.

Thank you in advance.


Solution

  • The pattern you are looking for is "\\b[A-Z]{2,}\\b". It will match on two or more capital letters in succession that have boundaries, \\b, on each side. That is the overall structure, fill in with the Russian alphabet where necessary.

    #test string. A correct count should be 1 0 2
    x <- c("My GREEN", "My Green", "MY GREEN")
    
    library(stringr)
    str_count(x, "\\b[A-Z]{2,}\\b")
    #[1] 1 0 2
    
    library(stringi)
    stri_count(x, regex="\\b[A-Z]{2,}\\b")
    #[1] 1 0 2
    
    #base R
    sapply(gregexpr("\\b[A-Z]{2,}\\b", x), function(x) length(c(x[x > 0])))
    #[1] 1 0 2
    

    Update

    If you would like character counts:

    sapply(str_match_all(x, "\\b[A-Z]{2,}\\b"), nchar)