Search code examples
rnlpquanteda

Measuring co-occurence patterns in media articles over time with Quanteda


I am trying to measure the number of times that different words co-occur with a particular term in collections of Chinese newspaper articles from each quarter of a year. To do this, I have been using Quanteda and written several R functions to run on each group of articles. My work steps are:

  1. Group the articles by quarter.
  2. Produce a frequency co-occurence matrix (FCM) for the articles in each quarter (Function 1).
  3. Take the column from this matrix for the 'term' I am interested in and convert this to a data.frame (Function 2)
  4. Merge the data.frames for each quarter together, then produce a large csv file with a column for each quarter and a row for each co-occurring term.

This seems to work okay. But I wondered if anybody more skilled in R might be able to check what I am doing is correct, or might suggest a more efficient way of doing it?

Thanks for any help!

#Function 1 to produce the FCM

get_fcm <- function(data) {
  ch_stop <- stopwords("zh", source = "misc")
  corp = corpus(data)
  toks = tokens(corp, remove_punct = TRUE) %>% tokens_remove(ch_stop)  
  fcm = fcm(toks, context = "window", window = 1, tri = FALSE)
  return(fcm)
}

>fcm_14q4 <- get_fcm(data_14q4)
>fcm_15q1 <- get_fcm(data_15q1)

#Function 2 to select the column for the 'term' of interest (such as China 中国) and make a data.frame

convert2df <- function(matrix, term){
  mat_term = matrix[,term]
  df = convert(mat_term, to = "data.frame")
  colnames(df)[1] = "Term"
  colnames(df)[2] = "Freq"
  x = df[order(-df$Freq),]
  return(x)
}

>CH14 <- convert2df(fcm_14q4, "中国")
>CH15 <- convert2df(fcm_15q1, "中国")

#Merging the data.frames

df <- merge(x=CH14q4, y=CH15q1, by="Term", all.x=TRUE, all.y=TRUE)
df <- merge(x=df, y=CH15q2, by="Term", all.x=TRUE, all.y=TRUE) #etc for all the dataframes... 

UPDATE: Following Ken's advice in the comments below, I have tried doing it a different way, using the window function of tokens_select() and then a document feature matrix. After labelling the corpus documents according to their quarter, the following R function should take the tokenized corpus toks and then produce a data.frame of the number of times words co-occur within a specified window of a term.

COOCdfm <- function(toks, term, window){
  ch_stop = stopwords("zh", source = "misc")
  cooc_toks = tokens_select(toks, term, window = window)
  cooc_toks2 = tokens(cooc_toks, remove_punct = TRUE)
  cooc_toks3 = tokens_remove(cooc_toks2, ch_stop)
  dfmat = dfm(cooc_toks3)
  dfmat_grouped = dfm_group(dfmat, groups = "quarter")
  counts = convert(t(dfmat_grouped), to = "data.frame")
  colnames(counts)[1] <- "Feature"
  return(counts)
} 

Solution

  • If you are interested in counting co-occurrences within a window for specific target terms, a better way is to use the window argument of tokens_select(), and then to count occurrences from a dfm on the window-selected tokens.

    library("quanteda")
    ## Package version: 3.0
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    toks <- tokens(data_corpus_inaugural)
    
    dfmat <- toks %>%
      tokens_select("nuclear", window = 5) %>%
      tokens(remove_punct = TRUE) %>%
      tokens_remove(stopwords("en")) %>%
      dfm()
    
    topfeatures(dfmat)[-1]
    ##     weapons      threat        work       earth elimination         day 
    ##           6           3           2           2           2           1 
    ##         one        free       world 
    ##           1           1           1
    

    Here I've first done a "conservative" tokenisation to keep everything, then performed the context selection. I then processed that further to remove punctuation and stopwords before tabulating the results in a dfm. This will be large and very sparse but you can summarise the top co-occuring words using topfeatures() or quanteda.textstats::textstat_frequency().