I am trying to replicate this paper
In the tokens.R script it's cleaning up the corpus with the following command:
texts(corp) <- stri_replace_all_regex(texts(corp), "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")
Which yields the following error message:
Error in attributes(.Data) <- c(attributes(.Data), attrib) :
'names' attribute [387896] must be the same length as the vector [4]
In addition: Warning message:
'texts.corpus' ist veraltet.
Benutzen Sie stattdessen 'as.character'
Siehe help("Deprecated")
So I naively apply the 'as.character' function like this:
as.character(corp) <- stri_replace_all_regex(as.character(corp), "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")
Which yields the following error
Error in attributes(.Data) <- c(attributes(.Data), attrib) :
'names' attribute [387896] must be the same length as the vector [4]
I tried some other things, like only adressing $documents within the corpus or turning the corpus into a vector but none of that really worked.
How can I get around this?
Thank you in advance.
The "corpus" being loaded in the linked .R file tokens.R
is using a very old format corpus object (from data/corpus_nytimes_summary.RDS
).
You can convert this into a new format corpus using:
corp <- corpus(corp)
Then replace the texts using this approach:
corp[] <- stri_replace_all_regex(corp, "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")
The use of corp[]
replaces the character part of corp
without stripping the additional attributes (metadata and docvars) that make the character object corp
a quanteda corpus.