Search code examples
rregextextnlp

Apostrophes and regular expressions; Cleaning text in R


I working on cleaning a large collection of text. My process thus far is:

  • Remove any non-ASCII characters
  • Remove URLs
  • Remove email addresses
  • Correct kerning (i.e., "B A D" becomes "BAD")
  • Correct elongated words (i.e., "baaaaaad" becomes "bad")
  • Ensure there is a space after every comma
  • Replace all numerals and punctuation with a space - except apostrophes
  • Remove any term 22 characters or longer (anything this size is likely garbage)
  • Remove any single letters that are leftover
  • Remove any blank lines

My issue is in the next-to-last step. Originally, my code was:

gsub(pattern = "\\b\\S\\b", replacement = "", perl = TRUE)

but this wrecked any contractions that were left (that I left in on purpose). Then I tried

gsub(pattern = "\\b(\\S^'\\s)\\b", replacement = "", perl = TRUE)

but this left a lot of single characters.

Then I realized that I needed to keep three single-letter words: "A", "I", and "O" (either case).

Any suggestions?


Solution

  • You can use

    gsub("(?i)\\b(?<!')(?![AOI])\\p{L}\\b", "", x, perl=TRUE)
    

    Details:

    • (?i) - case insensitive matching on
    • \b - a word boundary
    • (?<!') - no ' is allowed immediately on the left
    • (?![AOI]) - the next char cannot be A, I, or O
    • \p{L} - any Unicod letter
    • \b - a word boundary