Search code examples
rstringnlpgrepl

Removing rows that contain above a certain share of upper-case letters in R


I have a large dataframe that consists of company identifiers and extracted phrases from newspapers. It is very messy, and I want to clean it by conditional row removing.

enter image description here

For this I want to remove rows that have more then 50% upper-case letters.

I have found this code from a post which will remove me rows with all upper-case letters:

data <- data[!grepl("^[A-Z]+(?:[ -][A-Z]+)*$", data$text), ]

How can I express it as a share of the total word or letter count?


Solution

  • You could do this with regular expressions, but the stringi function stri_count_charclass provide a highly optimized version for detecting categories of characters. The package manual documents the List of Unicode General Categories, here we use string L for all letters, and Lu for uppercase letters.

    Something like this should accomplish what you need:

    library(stringi)
    
    data <- data.frame(text = c("Foo",
                                "BAr",
                                "BAZ"))
    
    data[which(stri_count_charclass(data[["text"]],"[\\p{Lu}]") / stri_count_charclass(data[["text"]],"[\\p{L}]") < 0.5),]
    # [1] "Foo"
    

    One note: I updated my answer here since I failed to point out a powerful feature of stringi in my original response. My instinctive reaction was to use [a-z] and [A-Z] to signify lower and upper case characters, respectively. However, using Unicode general categories allows the solution to work well for non-ascii characters as well.

    x = c("Foo",
          "BAr",
          "BAZ",
          "Ḟoo",
          "ḂÁr",
          "ḂÁẒ")
    stri_count_charclass(x,"[A-Z]")/stri_count_charclass(x,"[[a-z][A-Z]]")
    [1] 0.3333333 0.6666667 1.0000000 0.0000000 0.0000000       NaN
    
    stri_count_charclass(x,"[\\p{Lu}]")/stri_count_charclass(x,"[\\p{L}]")
    [1] 0.3333333 0.6666667 1.0000000 0.3333333 0.6666667 1.0000000