I’m exploring a textual corpus and I would like to be able to separate words following their grammatical type, for example consider only verbs and nouns.
I use spaCyr to do lemmatization with the spacy_parse()
function and have seen in Quanteda reference (https://quanteda.io/reference/as.tokens.html) that there is a as.tokens()
function that let me build a token object with the result of spacy_parse()
.
as.tokens(
x,
concatenator = "/",
include_pos = c("none", "pos", "tag"),
use_lemma = FALSE,
...
)
This way, I can get back something that looks like this (text is in French):
etu1_repres_1 :
[1] "OK/PROPN" ",/PUNCT" "déjà/ADV" ",/PUNCT" "je/PRON" "pense/VERB" "que/SCONJ"
[8] "je/PRON" "être/AUX" "influencer/VERB" "de/ADP" "par/ADP"
Let’s say I would like to separate the tokens and keep only tokens of type PRON and VERB.
Q1: How can I separate them from the other tokens to keep only:
etu1_repres_1 :
[1] "je/PRON" "pense/VERB" "je/PRON" "influencer/VERB"
Q2: How can I do to remove the "/PRON" or "/VERB" part of each token to be able to build a data-feature matrix with only the lemmas.
Thanks a lot for helping,
Gabriel
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <-
as.tokens(list(etu1_repres_1 = c("OK/PROPN", ",/PUNCT", "déjà/ADV", ",/PUNCT",
"je/PRON", "pense/VERB", "que/SCONJ", "je/PRON",
"être/AUX", "influencer/VERB", "de/ADP", "par/ADP")))
# part 1
toks2 <- tokens_keep(toks, c("*/PRON", "*/VERB"))
toks2
#> Tokens consisting of 1 document.
#> etu1_repres_1 :
#> [1] "je/PRON" "pense/VERB" "je/PRON" "influencer/VERB"
# part 2
toks3 <- tokens_split(toks2, "/") |>
tokens_remove(c("PRON", "VERB"))
toks3
#> Tokens consisting of 1 document.
#> etu1_repres_1 :
#> [1] "je" "pense" "je" "influencer"
dfm(toks3)
#> Document-feature matrix of: 1 document, 3 features (0.00% sparse) and 0 docvars.
#> features
#> docs je pense influencer
#> etu1_repres_1 2 1 1
Created on 2022-08-19 by the reprex package (v2.0.1)