Search code examples
rquanteda

Remove digits glued to words for quanteda objects of class tokens


A related question can be found here but does not directly tackle this issue I discuss below.

My goal is to remove any digits that occur with a token. For instance, I want to be able to get rid of the numbers in situations like: 13f, 408-k, 10-k, etc. I am using quanteda as the main tool. I have a classic corpus object which I tokenized using the function tokens(). The argument remove_numbers = TRUE does not seem to work in such cases since it just ignores the tokens and leave them where they are. If I use tokens_remove() with a specific regex, this removes the tokens which is something I want to avoid since I am interested in the remaining textual content.

Here is a minimal where I show how I solved the issue through the function str_remove_all() in stringr. It works, but can be very slow for big objects.

My question is: is there a way to achieve the same result without leaving quanteda (e.g., on an object of class tokens)?

library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
library(stringr)

mytext = c( "This is a sentence with correctly spaced digits like K 16.",
            "This is a sentence with uncorrectly spaced digits like 123asd and well101.")

# Tokenizing
mytokens = tokens(mytext, 
                  remove_punct = TRUE,
                  remove_numbers = TRUE )
mytokens
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> text2 :
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "123asd"     
#> [11] "and"         "well101"

# the tokens "123asd" and "well101" are still there.
# I can be more specific using a regex but this removes the tokens altogether
# 
mytokens_wrong = tokens_remove( mytokens, pattern = "[[:digit:]]", valuetype = "regex")
mytokens_wrong
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> text2 :
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "and"

# This is the workaround which seems to be working but can be very slow.
# I am using stringr::str_remove_all() function
# 
mytokens_ok = lapply( mytokens, function(x) str_remove_all( x, "[[:digit:]]" ) )
mytokens_ok
#> $text1
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> $text2
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "asd"        
#> [11] "and"         "well"

Created on 2021-02-15 by the reprex package (v0.3.0)


Solution

  • In this case you could (ab)use tokens_split. You split the tokens on the digits and by default tokens_split removes the separator. In this way you can do everything in quanteda.

    library(quanteda)
    
    mytext = c( "This is a sentence with correctly spaced digits like K 16.",
                "This is a sentence with uncorrectly spaced digits like 123asd and well101.")
    
    # Tokenizing
    mytokens = tokens(mytext, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE)
    
    tokens_split(mytokens, separator = "[[:digit:]]", valuetype = "regex")
    Tokens consisting of 2 documents.
    text1 :
     [1] "This"      "is"        "a"         "sentence"  "with"      "correctly" "spaced"    "digits"    "like"     
    [10] "K"        
    
    text2 :
     [1] "This"        "is"          "a"           "sentence"    "with"        "uncorrectly" "spaced"      "digits"     
     [9] "like"        "asd"         "and"         "well"