How can I separate words in a corpus according to their POS?

I’m exploring a textual corpus and I would like to be able to separate words following their grammatical type, for example consider only verbs and nouns.

I use spaCyr to do lemmatization with the spacy_parse() function and have seen in Quanteda reference (https://quanteda.io/reference/as.tokens.html) that there is a as.tokens() function that let me build a token object with the result of spacy_parse().

as.tokens(
  x,
  concatenator = "/",
  include_pos = c("none", "pos", "tag"),
  use_lemma = FALSE,
  ...
)

This way, I can get back something that looks like this (text is in French):

etu1_repres_1 :
 [1] "OK/PROPN"        ",/PUNCT"         "déjà/ADV"        ",/PUNCT"         "je/PRON"         "pense/VERB"      "que/SCONJ"      
 [8] "je/PRON"         "être/AUX"        "influencer/VERB" "de/ADP"          "par/ADP"

Let’s say I would like to separate the tokens and keep only tokens of type PRON and VERB.

Q1: How can I separate them from the other tokens to keep only:

etu1_repres_1 :
[1] "je/PRON"         "pense/VERB"  "je/PRON"        "influencer/VERB"

Q2: How can I do to remove the "/PRON" or "/VERB" part of each token to be able to build a data-feature matrix with only the lemmas.

Thanks a lot for helping,

Gabriel

Solution

library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

toks <- 
  as.tokens(list(etu1_repres_1 = c("OK/PROPN", ",/PUNCT", "déjà/ADV", ",/PUNCT", 
                                   "je/PRON", "pense/VERB", "que/SCONJ", "je/PRON", 
                                   "être/AUX", "influencer/VERB", "de/ADP", "par/ADP")))

# part 1
toks2 <- tokens_keep(toks, c("*/PRON", "*/VERB"))
toks2
#> Tokens consisting of 1 document.
#> etu1_repres_1 :
#> [1] "je/PRON"         "pense/VERB"      "je/PRON"         "influencer/VERB"

# part 2
toks3 <- tokens_split(toks2, "/") |>
  tokens_remove(c("PRON", "VERB"))
toks3
#> Tokens consisting of 1 document.
#> etu1_repres_1 :
#> [1] "je"         "pense"      "je"         "influencer"
dfm(toks3)
#> Document-feature matrix of: 1 document, 3 features (0.00% sparse) and 0 docvars.
#>                features
#> docs            je pense influencer
#>   etu1_repres_1  2     1          1

^{Created on 2022-08-19 by the reprex package (v2.0.1)}