Search code examples
rtm

Retaining unique identifiers (e.g., record id) when using tm functions - doesn't work with lot's of data?


I am working with unstructured text (Facebook) data, and am pre-processing it (e.g., stripping punctuation, removing stop words, stemming). I need to retain the record (i.e., Facebook post) ids while pre-processing. I have a solution that works on a subset of the data but fails with all the data (N = 127K posts). I have tried chunking the data, and that doesn't work either. I think it has something to do with me using a work-around, and relying on row names. For example, it appears to work with the first ~15K posts but when I keep subsetting, it fails. I realize my code is less than elegant so happy to learn better/completely different solutions - all I care about is keeping the IDs when I go to V Corpus and then back again. I'm new to the tm package and the readTabular function in particular. (Note: I ran the to lower and remove Words before making the VCorpus as I originally thought that was part of the issue).

Working code is below:

Sample data

fb = data.frame(RecordContent = c("I'm dating a celebrity! Skip to 2:02 if you, like me, don't care about the game.",
                                "Photo fails of this morning. Really Joe?", 
                                "This piece has been almost two years in the making. Finally finished! I'm antsy for October to come around... >:)"),
                                FromRecordId = c(682245468452447, 737891849554475, 453178808037464),
                                stringsAsFactors = F)

Remove punctuation & make lower case

fb$RC = tolower(gsub("[[:punct:]]", "", fb$RecordContent)) 
fb$RC2 = removeWords(fb$RC, stopwords("english"))

Step 1: Create special reader function to retain record IDs

myReader = readTabular(mapping=list(content="RC2", id="FromRecordId"))

Step 2: Make my corpus. Read in the data using DataframeSource and the custom reader function where each FB post is a "document"

corpus.test = VCorpus(DataframeSource(fb),      readerControl=list(reader=myReader))

Step 3: Clean and stem

 corpus.test2 = corpus.test %>% 
tm_map(removeNumbers) %>% 
tm_map(stripWhitespace) %>% 
tm_map(stemDocument, language = "english") %>% 
as.VCorpus()

Step 4: Make the corpus back into a character vector. The row names are now the IDs

fb2 = data.frame(unlist(sapply(corpus.test2, `[`, "content")), stringsAsFactors = F)

Step 5: Make new ID variable for later merge, name vars, and prep for merge back onto original dataset

fb2$ID = row.names(fb2)
fb2$RC.ID = gsub(".content", "", fb2$ID)
colnames(fb2)[1] = "RC.stem"
fb3 = select(fb2, RC.ID, RC.stem)
row.names(fb3) = NULL

Solution

  • I think the ids are being stored and retained by default, by the tm module. You can fetch them all (in a vectorized manner) with

    meta(corpus.test, "id")

    $`682245468452447`
    [1] "682245468452447"
    
    $`737891849554475`
    [1] "737891849554475"
    
    $`453178808037464`
    [1] "453178808037464"
    

    I'd recommend to read the documentation of the the tm::meta() function, but it's not very good.

    You can also add arbitrary metadata (as key-value pairs) to each collection item in the corpus, as well as collection-level metadata.