Search code examples
rnlpquanteda

Is there any way to split quanteda tokens into n equal parts?


I'm performing text analysis using the quanteda package in R.

I have a set of text documents that I already tokenized. Each consists of a different amount of tokens. I want to split the tokens into N equal chunks of tokens (e.g. 10 or 20 chunks that consist of an equal amount of tokens for each text).

Assume my data is called text_docs and looks as follows:

Text  | Tokens
Text1 | "this" "is" "an" "example" "this" "is" "an" "example"
Text2 | "this" "is" "an" "example"
Text3 | "this" "is" "an" "example" "this" "is" "an" "example" "this" "is" "an" "example"

The results that I would like to get should look like this (with two chunks instead of twenty):

Text  | Chunk1                                 | Chunk2
Text1 | "this" "is" "an" "example"             | "this" "is" "an" "example"
Text2 | "this" "is"                            | "an" "example"
Text3 | "this" "is" "an" "example" "this" "is" | "an" "example" "this" "is" "an" "example"

I'm aware of the tokens_chunk function in quanteda. Yet, this function only enables me to create a set of chunks of equal size (e.g. each chunk consists of two tokens), which leaves me with a different amount of chunks. Furthermore, the command size in the tokens_chunk function has to be a single integer, which is why I can't simply do this chunks <- tokens_chunk(text_docs, size = ntokens(text_docs)/20).

Any idea?

Thank you in advance.


Solution

  • library("quanteda")
    ## Package version: 2.1.2
    
    toks <- c(
      Text1 = "this is an example this is an example",
      Text2 = "this is an example",
      Text3 = "this is an example this is an example this is an example"
    ) %>%
      tokens()
    
    toks
    ## Tokens consisting of 3 documents.
    ## Text1 :
    ## [1] "this"    "is"      "an"      "example" "this"    "is"      "an"     
    ## [8] "example"
    ## 
    ## Text2 :
    ## [1] "this"    "is"      "an"      "example"
    ## 
    ## Text3 :
    ##  [1] "this"    "is"      "an"      "example" "this"    "is"      "an"     
    ##  [8] "example" "this"    "is"      "an"      "example"
    

    Here's one way to do what you want. We will lapply over the docnames to slice out each document, and then split it using tokens_chunk() with a size equal to half of its length. Here, I also use ceiling so that if the token length is odd for a document, it will have one more token in its first split than in its second. (Your example was all for even-tokened documents, but this handles the odd-tokened case too.)

    lis <- lapply(
      docnames(toks),
      function(x) tokens_chunk(toks[x], size = ceiling(ntoken(toks[x]) / 2))
    )
    

    That results in a list of split tokens, and you can recombine them by using the c() function which concatenates tokens. You apply this to the list using do.call().

    do.call("c", lis)
    ## Tokens consisting of 6 documents.
    ## Text1.1 :
    ## [1] "this"    "is"      "an"      "example"
    ## 
    ## Text1.2 :
    ## [1] "this"    "is"      "an"      "example"
    ## 
    ## Text2.1 :
    ## [1] "this" "is"  
    ## 
    ## Text2.2 :
    ## [1] "an"      "example"
    ## 
    ## Text3.1 :
    ## [1] "this"    "is"      "an"      "example" "this"    "is"     
    ## 
    ## Text3.2 :
    ## [1] "an"      "example" "this"    "is"      "an"      "example"