Search code examples
rnlpquanteda

get what percent of documents contain a feature - quanteda


I'm trying to understand what % of documents contain a feature using quanteda. I know the dfm_weight() is available, but I believe the 'prop' feature looks at feature frequency within a document and not across documents.

My goal would be to avoid having to do the ifelse statement and keep it all in quanteda, but I'm not sure this is possible. The output I'm looking for is a side-by-side bar chart grouped by year that has features along the y-axis and % occurrence in documents along the x. The interpretation here would then be "In 20% of all comments in 2018, people mention the word X, compared to 24% in 2019."

library(quanteda)
library(reshape2)
library(dplyr)

df$rownum = 1:nrow(df) # unique ID
dfCorp19 = df %>%
  corpus(df, text_field = 'WhatPromptedYourSearch', docid_field = 'rownum')

x = dfm(dfCorp19,
        remove=c(stopwords(), toRemove),
        remove_numbers = TRUE,
        remove_punct = TRUE) %>%
    textstat_frequency(groups ='year') 

x = x %>% group_by(group) %>% mutate(prop = ifelse(group=='2019', docfreq/802, docfreq/930))
x = dcast(x,feature ~ group, value.var='prop')

Solution

  • Here's an attempt using some demo data, where the group is decade.

    library("quanteda")
    #> Package version: 1.5.1
    
    docvars(data_corpus_inaugural, "decade") <-
        floor(docvars(data_corpus_inaugural, "Year") / 10) * 10
    
    dfmat <- dfm(corpus_subset(data_corpus_inaugural, decade >= 1970))
    
    target_word <- "nuclear"
    

    Now we can just extract a data.frame for the target feature. Note the rowSums() function since otherwise, any slice of a dfm is still a dfm (not a vector).

    df <- data.frame(docname = docnames(dfmat),
                     decade = docvars(dfmat, c("decade")),
                     contains_target = rowSums(dfmat[, "nuclear"]) > 0,
                     row.names = NULL)
    df
    #>         docname decade contains_target
    #> 1    1973-Nixon   1970            TRUE
    #> 2   1977-Carter   1970            TRUE
    #> 3   1981-Reagan   1980           FALSE
    #> 4   1985-Reagan   1980            TRUE
    #> 5     1989-Bush   1980           FALSE
    #> 6  1993-Clinton   1990           FALSE
    #> 7  1997-Clinton   1990            TRUE
    #> 8     2001-Bush   2000           FALSE
    #> 9     2005-Bush   2000           FALSE
    #> 10   2009-Obama   2000            TRUE
    #> 11   2013-Obama   2010           FALSE
    #> 12   2017-Trump   2010           FALSE
    

    With that, it's a simple matter to summarize proportions and plot them using some dplyr and ggplot2.

    library("dplyr")
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    df2 <- df %>%
        group_by(decade) %>%
        summarise(n = n()) %>%
        mutate(freq = n / sum(n))
    
    library("ggplot2")
    g <- ggplot(df2, aes(y = freq, x = decade)) +
        geom_bar(stat = "identity") +
        coord_flip() +
        xlab("") + ylab("Proportion of documents containing target word")
    g
    

    Created on 2019-10-21 by the reprex package (v0.3.0)