Search code examples
data-extraction

corpus extraction with changing data type R


i have a corpus of text files, contains just text, I want to extract the ngrams from the texts and save each one with his original file name in matrixes of 3 columns..

   library(tokenizer)      
    myTokenizer <- function(x, n, n_min) {

corp<-"this is a full text "
     tok <- unlist(tokenize_ngrams(as.character(x), n = n, n_min = n_min))
      M <- matrix(nrow=length(tok), ncol=3, 
                  dimnames=list(NULL, c( "gram" , "num.words", "words")))
      }
    corp <- tm_map(corp,content_transformer(function (x) myTokenizer(x, n=3, n_min=1)))

        writecorpus(corp)

Solution

  • I would recommend to create a document term matrix (DTM). You will probably need this in your downstream tasks anyway. From that you could also extract the information you want, although, it is probably not reasonable to assume that a term (incl. ngrams) only has a single document where its coming from (at least this is what I understood from your question, please correct me if I am wrong). Therefore, I guess that in practice one term will have several documents associated with it - this kind of information is usually stored in a DTM.

    An example with text2vec below. If you could elaborate further how you want to use your terms, etc. I could adapt the code according to your needs.

    library(text2vec)
    # I have set up two text do not overlap in any term just as an example
    # in practice, this probably never happens
    docs = c(d1 = c("here a text"), d2 = c("and another one"))
    it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
    v = create_vocabulary(it, ngram = c(1,3))
    vectorizer = vocab_vectorizer(v)
    dtm = create_dtm(it, vectorizer)
    as.matrix(dtm)
    #    a a_text and and_another and_another_one another another_one here here_a here_a_text one text
    # d1 1      1   0           0               0       0           0    1      1           1   0    1
    # d2 0      0   1           1               1       1           1    0      0           0   1    0
    
    library(stringi)
    docs = c(d1 = c("here a text"), d2 = c("and another one"))
    it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
    v = create_vocabulary(it, ngram = c(1,3))
    vectorizer = vocab_vectorizer(v)
    dtm = create_dtm(it, vectorizer)
    for (d in rownames(dtm)) {
      v = dtm[d, ]
      v = v[v!=0]
      v = data.frame(number = 1:length(v)
                     ,term = names(v))
      v$n = stri_count_fixed(v$term, "_")+1
      write.csv(v, file = paste0("v_", d, ".csv"), row.names = F)
    }
    read.csv("v_d1.csv")
    #   number        term n
    # 1      1           a 1
    # 2      2      a_text 2
    # 3      3        here 1
    # 4      4      here_a 2
    # 5      5 here_a_text 3
    # 6      6        text 1
    read.csv("v_d2.csv")
    #   number            term n
    # 1      1             and 1
    # 2      2     and_another 2
    # 3      3 and_another_one 3
    # 4      4         another 1
    # 5      5     another_one 2
    # 6      6             one 1