Search code examples
rspacyquanteda

How can I separate words in a corpus according to their POS?


I’m exploring a textual corpus and I would like to be able to separate words following their grammatical type, for example consider only verbs and nouns.

I use spaCyr to do lemmatization with the spacy_parse() function and have seen in Quanteda reference (https://quanteda.io/reference/as.tokens.html) that there is a as.tokens() function that let me build a token object with the result of spacy_parse().

as.tokens(
  x,
  concatenator = "/",
  include_pos = c("none", "pos", "tag"),
  use_lemma = FALSE,
  ...
)

This way, I can get back something that looks like this (text is in French):

etu1_repres_1 :
 [1] "OK/PROPN"        ",/PUNCT"         "déjà/ADV"        ",/PUNCT"         "je/PRON"         "pense/VERB"      "que/SCONJ"      
 [8] "je/PRON"         "être/AUX"        "influencer/VERB" "de/ADP"          "par/ADP"

Let’s say I would like to separate the tokens and keep only tokens of type PRON and VERB.

Q1: How can I separate them from the other tokens to keep only:

etu1_repres_1 :
[1] "je/PRON"         "pense/VERB"  "je/PRON"        "influencer/VERB"

Q2: How can I do to remove the "/PRON" or "/VERB" part of each token to be able to build a data-feature matrix with only the lemmas.

Thanks a lot for helping,

Gabriel


Solution

  • library("quanteda")
    #> Package version: 3.2.1
    #> Unicode version: 14.0
    #> ICU version: 70.1
    #> Parallel computing: 10 of 10 threads used.
    #> See https://quanteda.io for tutorials and examples.
    
    toks <- 
      as.tokens(list(etu1_repres_1 = c("OK/PROPN", ",/PUNCT", "déjà/ADV", ",/PUNCT", 
                                       "je/PRON", "pense/VERB", "que/SCONJ", "je/PRON", 
                                       "être/AUX", "influencer/VERB", "de/ADP", "par/ADP")))
    
    # part 1
    toks2 <- tokens_keep(toks, c("*/PRON", "*/VERB"))
    toks2
    #> Tokens consisting of 1 document.
    #> etu1_repres_1 :
    #> [1] "je/PRON"         "pense/VERB"      "je/PRON"         "influencer/VERB"
    
    # part 2
    toks3 <- tokens_split(toks2, "/") |>
      tokens_remove(c("PRON", "VERB"))
    toks3
    #> Tokens consisting of 1 document.
    #> etu1_repres_1 :
    #> [1] "je"         "pense"      "je"         "influencer"
    dfm(toks3)
    #> Document-feature matrix of: 1 document, 3 features (0.00% sparse) and 0 docvars.
    #>                features
    #> docs            je pense influencer
    #>   etu1_repres_1  2     1          1
    

    Created on 2022-08-19 by the reprex package (v2.0.1)