r character gsub topic-modeling quanteda

removing special apostrophes from French article contractions when tokenizing

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text. I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc... There's only one thing, though, that doesn't seem to work. As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords

lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))

but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree). Frustrated, I've tried to give it a simple gsub

lmt67 <- gsub("l'","",lmt67)

but that doesn't seem to be working either. Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?

Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.

Thanks to anyone that will want to help me.

Solution

I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.

txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche, 
                 n’en a pas dit mot devant la presse. En réalité, il s’agit d’une 
                 mesure essentiellement commerciale de ce pays qui l'importe.", 
         doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme 
                 successeur Jordi Sanchez, partisan de l’indépendance catalane, 
                 actuellement en prison pour sédition.")

The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category. However, quanteda does its best to normalize the quotes at the tokenization stage - see the "l'indépendance" in the second document for instance.

The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar selection after recognizing and separating the prefixes, and then removing determinants (among other POS).

1. quanteda tokens

toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M"               "Trump"           "lors"            "d'une"           "réunion"        
# [6] "convoquée"       "d'urgence"       "à"               "la"              "Maison"         
# [11] "Blanche"         "n'en"            "a"               "pas"             "dit"            
# [16] "mot"             "devant"          "la"              "presse"          "En"             
# [21] "réalité"         "il"              "s'agit"          "d'une"           "mesure"         
# [26] "essentiellement" "commerciale"     "de"              "ce"              "pays"           
# [31] "qui"             "l'importe"      
# 
# doc2 :
# [1] "Réfugié"           "à"                 "Bruxelles"         "l'indépendantiste"
# [5] "catalan"           "a"                 "désigné"           "comme"            
# [9] "successeur"        "Jordi"             "Sanchez"           "partisan"         
# [13] "de"                "l'indépendance"    "catalane"          "actuellement"     
# [17] "en"                "prison"            "pour"              "sédition"

Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):

toks <- tokens_replace(
    toks, 
    types(toks), 
    stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M"               "Trump"           "lors"            "une"             "réunion"        
# [6] "convoquée"       "urgence"         "à"               "la"              "Maison"         
# [11] "Blanche"         "n'en"            "a"               "pas"             "dit"            
# [16] "mot"             "devant"          "la"              "presse"          "En"             
# [21] "réalité"         "il"              "agit"            "une"             "mesure"         
# [26] "essentiellement" "commerciale"     "de"              "ce"              "pays"           
# [31] "qui"             "importe"        
# 
# doc2 :
# [1] "Réfugié"         "à"               "Bruxelles"       "indépendantiste" "catalan"        
# [6] "a"               "désigné"         "comme"           "successeur"      "Jordi"          
# [11] "Sanchez"         "partisan"        "de"              "indépendance"    "catalane"       
# [16] "actuellement"    "En"              "prison"          "pour"            "sédition"

From the resulting toks object you can form a dfm and then proceed to fit the STM.

2. using spacyr

This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)

library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)

toks <- spacy_parse(txt, lemma = FALSE) %>%
    as.tokens(include_pos = "pos") 
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN"                   "Trump/PROPN"               ",/PUNCT"                  
# [4] "lors/ADV"                  "d’/PUNCT"                  "une/DET"                  
# [7] "réunion/NOUN"              "convoquée/VERB"            "d’/ADP"                   
# [10] "urgence/NOUN"              "à/ADP"                     "la/DET"                   
# [13] "Maison/PROPN"              "Blanche/PROPN"             ",/PUNCT"                  
# [16] "\n                 /SPACE" "n’/VERB"                   "en/PRON"                  
# [19] "a/AUX"                     "pas/ADV"                   "dit/VERB"                 
# [22] "mot/ADV"                   "devant/ADP"                "la/DET"                   
# [25] "presse/NOUN"               "./PUNCT"                   "En/ADP"                   
# [28] "réalité/NOUN"              ",/PUNCT"                   "il/PRON"                  
# [31] "s’/AUX"                    "agit/VERB"                 "d’/ADP"                   
# [34] "une/DET"                   "\n                 /SPACE" "mesure/NOUN"              
# [37] "essentiellement/ADV"       "commerciale/ADJ"           "de/ADP"                   
# [40] "ce/DET"                    "pays/NOUN"                 "qui/PRON"                 
# [43] "l'/DET"                    "importe/NOUN"              "./PUNCT"                  
# 
# doc2 :
# [1] "Réfugié/VERB"              "à/ADP"                     "Bruxelles/PROPN"          
# [4] ",/PUNCT"                   "l’/PRON"                   "indépendantiste/ADJ"      
# [7] "catalan/VERB"              "a/AUX"                     "désigné/VERB"             
# [10] "comme/ADP"                 "\n                 /SPACE" "successeur/NOUN"          
# [13] "Jordi/PROPN"               "Sanchez/PROPN"             ",/PUNCT"                  
# [16] "partisan/VERB"             "de/ADP"                    "l’/DET"                   
# [19] "indépendance/ADJ"          "catalane/ADJ"              ",/PUNCT"                  
# [22] "\n                 /SPACE" "actuellement/ADV"          "en/ADP"                   
# [25] "prison/NOUN"               "pour/ADP"                  "sédition/NOUN"            
# [28] "./PUNCT"

Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:

toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN"             "Trump/PROPN"         "lors/ADV"            "réunion/NOUN"        "convoquée/VERB"     
# [6] "urgence/NOUN"        "Maison/PROPN"        "Blanche/PROPN"       "n’/VERB"             "pas/ADV"            
# [11] "dit/VERB"            "mot/ADV"             "presse/NOUN"         "réalité/NOUN"        "agit/VERB"          
# [16] "mesure/NOUN"         "essentiellement/ADV" "commerciale/ADJ"     "pays/NOUN"           "importe/NOUN"       
# 
# doc2 :
# [1] "Réfugié/VERB"        "Bruxelles/PROPN"     "indépendantiste/ADJ" "catalan/VERB"        "désigné/VERB"       
# [6] "successeur/NOUN"     "Jordi/PROPN"         "Sanchez/PROPN"       "partisan/VERB"       "indépendance/ADJ"   
# [11] "catalane/ADJ"        "actuellement/ADV"    "prison/NOUN"         "sédition/NOUN"

Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.

## remove the tags
toks <- tokens_replace(toks, types(toks), 
                       stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M."              "Trump"           "lors"            "réunion"         "convoquée"      
# [6] "urgence"         "Maison"          "Blanche"         "n’"              "pas"            
# [11] "dit"             "mot"             "presse"          "réalité"         "agit"           
# [16] "mesure"          "essentiellement" "commerciale"     "pays"            "importe"        
# 
# doc2 :
# [1] "Réfugié"         "Bruxelles"       "indépendantiste" "catalan"         "désigné"        
# [6] "successeur"      "Jordi"           "Sanchez"         "partisan"        "indépendance"   
# [11] "catalane"        "actuellement"    "prison"          "sédition"

From there, you can use the toks object to form your dfm and fit the model.