A related question can be found here but does not directly tackle this issue I discuss below.
My goal is to remove any digits that occur with a token. For instance, I want to be able to get rid of the numbers in situations like: 13f
, 408-k
, 10-k
, etc. I am using quanteda as the main tool. I have a classic corpus object which I tokenized using the function tokens()
. The argument remove_numbers = TRUE
does not seem to work in such cases since it just ignores the tokens and leave them where they are. If I use tokens_remove()
with a specific regex, this removes the tokens which is something I want to avoid since I am interested in the remaining textual content.
Here is a minimal where I show how I solved the issue through the function str_remove_all()
in stringr. It works, but can be very slow for big objects.
My question is: is there a way to achieve the same result without leaving quanteda (e.g., on an object of class tokens
)?
library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
library(stringr)
mytext = c( "This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101.")
# Tokenizing
mytokens = tokens(mytext,
remove_punct = TRUE,
remove_numbers = TRUE )
mytokens
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> text2 :
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "123asd"
#> [11] "and" "well101"
# the tokens "123asd" and "well101" are still there.
# I can be more specific using a regex but this removes the tokens altogether
#
mytokens_wrong = tokens_remove( mytokens, pattern = "[[:digit:]]", valuetype = "regex")
mytokens_wrong
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> text2 :
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "and"
# This is the workaround which seems to be working but can be very slow.
# I am using stringr::str_remove_all() function
#
mytokens_ok = lapply( mytokens, function(x) str_remove_all( x, "[[:digit:]]" ) )
mytokens_ok
#> $text1
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> $text2
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "asd"
#> [11] "and" "well"
Created on 2021-02-15 by the reprex package (v0.3.0)
In this case you could (ab)use tokens_split
. You split the tokens on the digits and by default tokens_split
removes the separator. In this way you can do everything in quanteda.
library(quanteda)
mytext = c( "This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101.")
# Tokenizing
mytokens = tokens(mytext,
remove_punct = TRUE,
remove_numbers = TRUE)
tokens_split(mytokens, separator = "[[:digit:]]", valuetype = "regex")
Tokens consisting of 2 documents.
text1 :
[1] "This" "is" "a" "sentence" "with" "correctly" "spaced" "digits" "like"
[10] "K"
text2 :
[1] "This" "is" "a" "sentence" "with" "uncorrectly" "spaced" "digits"
[9] "like" "asd" "and" "well"