Search code examples
rdictionaryencodingpunctuation

R {quanteda}: remove accents in a dictionary


I want to remove accents and punctuation from a dictionary. For example, I want to transform "à l'épreuve" into "a l epreuve". The dictionary is this one: https://www.poltext.org/fr/donnees-et-analyses/lexicoder (.cat). There are explanations for dataframes (Remove accents from a dataframe column in R), but I could not find a way of removing for dictionaries.

My code so far:

dict_lg <- dictionary(file = "frlsd/frlsd.cat", encoding = "UTF-8")

Any suggestion?


Solution

  • This should work:

    library(quanteda)
    library(stringi)
    library(stringr)
    
    dict_lg_ascii <- 
      dict_lg |> 
      rapply(f = \(term) term |>
                  ## compose from string utilities as desired       
                  stri_trans_general(id = 'Latin-ASCII') |>
                  str_replace_all(pattern = '[[:punct:]]', replacement = ' '),
             how = 'replace'
             )
    

    output:

    ## > dict_lg_ascii
    Dictionary object with 2 primary key entries and 2 nested levels.
    - [NEGATIVE]:
      - a cornes, a court de personnel , a l etroit, a peine , abais , 
    ## truncated
    

    from the docs:

    Dictionaries can be subsetted using [ and [[, operating the same as the equivalent list operators.

    Thus rapply (recursively applying a function over nested lists) works. In this case, we apply stri_trans_general as suggested here.