Search code examples
rtext-miningfeature-selectionquantedafselector

Feature selection in document-feature matrix by using chi-squared test


I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test. I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r)

I learned that I could use chi.squared in FSelector package but I don't know how to apply this function to a dfm class object (trainingtfidf below). (Shows in the manual, it applies to the predictor variable)

Could anyone give me a hint? I appreciate it!

Example code:

description <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.",  "M6 is 13 days out of the visit window")
code <- c(4,3,6)
example <- data.frame(description, code)

library(quanteda)
trainingcorpus <- corpus(example$description)

trainingdfm <- dfm(trainingcorpus, verbose = TRUE, stem=TRUE, toLower=TRUE, removePunct= TRUE, removeSeparators=TRUE, language="english", ignoredFeatures = stopwords("english"), removeNumbers=TRUE, ngrams = 2)

# tf-idf
trainingtfidf <- tfidf(trainingdfm, normalize=TRUE)

sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

Solution

  • Here's a general method for computing Chi-squared values for features. It requires that you have some variable against which to form the associations, which here could be some classification variable you are using for training your classifier.

    Note that I am showing how to do this in the quanteda package, but the results should be general enough to work for other text package matrix objects. Here, I am using the data from the auxiliary quantedaData package that has all of the State of the Union addresses of US presidents.

    data(data_corpus_sotu, package = "quanteda.corpora")
    table(docvars(data_corpus_sotu, "party"))
    ## Democratic Democratic-Republican            Federalist           Independent 
    ##         90                    28                     4                     8 
    ## Republican                  Whig 
    ##         9                     8 
    sotuDemRep <- corpus_subset(data_corpus_sotu, party %in% c("Democratic", "Republican"))
    
    # make the document-feature matrix for just Reps and Dems
    sotuDfm <- dfm(sotuDemRep, remove = stopwords("english"))
    
    # compute chi-squared values for each feature
    chi2vals <- apply(sotuDfm, 2, function(x) { 
        chisq.test(as.numeric(x), docvars(sotuDemRep, "party"))$statistic
    })
    
    head(sort(chi2vals, decreasing = TRUE), 10)
    ## government       will     united     states       year     public   congress       upon 
    ##   85.19783   74.55845   68.62642   66.57434   64.30859   63.19322   59.49949   57.83603 
    ##        war     people 
    ##   57.43142   57.38697 
    

    These can now be selected using the dfm_select() command. (Note that column indexing by name would also work.)

    # select just 100 top Chi^2 vals from dfm
    dfmTop100cs <- dfm_select(sotuDfm, names(head(sort(chi2vals, decreasing = TRUE), 100)))
    ## kept 100 features, from 100 supplied (glob) feature types
    
    head(dfmTop100cs)
    ## Document-feature matrix of: 182 documents, 100 features.
    ## (showing first 6 documents and first 6 features)
    ##               features
    ## docs           citizens government upon duties constitution present
    ##   Jackson-1830       14         68   67     12           17      23
    ##   Jackson-1831       21         26   13      7            5      22
    ##   Jackson-1832       17         36   23     11           11      18
    ##   Jackson-1829       17         58   37     16            7      17
    ##   Jackson-1833       14         43   27     18            1      17
    ##   Jackson-1834       24         74   67     11           11      29
    

    Added: With >= v0.9.9 this can be done using the textstat_keyness() function.

    # to avoid empty factors
    docvars(data_corpus_sotu, "party") <- as.character(docvars(data_corpus_sotu, "party"))
    
    # make the document-feature matrix for just Reps and Dems
    sotuDfm <- data_corpus_sotu %>%
        corpus_subset(party %in% c("Democratic", "Republican")) %>%
        dfm(remove = stopwords("english"))
    
    chi2vals <- dfm_group(sotuDfm, "party") %>%
        textstat_keyness(measure = "chi2")
    head(chi2vals)
    #   feature     chi2 p n_target n_reference
    # 1       - 221.6249 0     2418        1645
    # 2  mexico 181.0586 0      505         182
    # 3    bank 164.9412 0      283          60
    # 4       " 148.6333 0     1265         800
    # 5 million 132.3267 0      366         131
    # 6   texas 101.1991 0      174          37
    

    This information can then be used to select the most discriminating features, after the sign of the chi^2 score is removed.

    # remove sign
    chi2vals$chi2 <- abs(chi2vals$chi2)
    # sort
    chi2vals <- chi2vals[order(chi2vals$chi2, decreasing = TRUE), ]
    head(chi2vals)
    #          feature     chi2 p n_target n_reference
    # 1              - 221.6249 0     2418        1645
    # 29044 commission 190.3010 0      175         588
    # 2         mexico 181.0586 0      505         182
    # 3           bank 164.9412 0      283          60
    # 4              " 148.6333 0     1265         800
    # 29043        law 137.8330 0      607        1178
    
    
    dfmTop100cs <- dfm_select(sotuDfm, chi2vals$feature)
    ## kept 100 features, from 100 supplied (glob) feature types
    
    head(dfmTop100cs, nf = 6)
    Document-feature matrix of: 6 documents, 6 features (0% sparse).
    6 x 6 sparse Matrix of class "dfm"
                  features
    docs           fellow citizens senate house representatives :
      Jackson-1829      5       17      2     3               5 1
      Jackson-1830      6       14      4     6               9 3
      Jackson-1831      9       21      3     1               4 1
      Jackson-1832      6       17      4     1               2 1
      Jackson-1833      2       14      7     4               6 1
      Jackson-1834      3       24      5     1               3 5