Search code examples
rquanteda

how to combine/collapse rows with similar words and sum up their values?


I did text mining using Quanteda package. Using kwic feature, I generated this output that identifies keywords in my dictionary and identifies which keywords matched within each key. The data looks like this:

docname     keyword        my_dict          
   <chr>       <chr>          <chr>            
 1 avan-21.pdf sustainable    transition       
 2 avan-21.pdf electricity    low_carbon_energy
 3 avan-21.pdf electricity    low_carbon_energy
 4 avan-21.pdf renewable      low_carbon_energy
 5 avan-21.pdf electricity    low_carbon_energy
 6 avan-21.pdf wind           low_carbon_energy
 7 avan-21.pdf wind           low_carbon_energy
 8 avan-21.pdf solar          low_carbon_energy
 9 avan-21.pdf emissions      emissions        
10 avan-21.pdf emissions-free emissions        
11 avan-21.pdf sustainable    transition       
12 avan-21.pdf renewable      low_carbon_energy
13 avan-21.pdf wind           low_carbon_energy
14 avan-21.pdf solar          low_carbon_energy
15 avan-21.pdf biomass        low_carbon_energy
16 avan-21.pdf sustainability transition       
17 avan-21.pdf sustainability transition       
18 avan-21.pdf sustainability transition       
19 avan-21.pdf sustainability transition       
20 avan-21.pdf sustainability transition  

I filtered this data by dictionary keys (my_dict) to create sub-categories like this:

climate_change <- kwic2filter %>% 
  filter(my_dict == "climate_change") %>%
  select(docname, keyword) %>% 
  group_by(docname, keyword) %>% 
  count(keyword, sort = TRUE) %>% 
  arrange(keyword, desc(n))%>%
  write.csv("energy-output/climate_change.csv")

Results look like this:

 X     docname          keyword   n
1   1 enel-22.pdf              1.5  97
2   2 enel-21.pdf              1.5  66
3   3  nrg-21.pdf              1.5   7
4   4  nrg-22.pdf              1.5   4
5   5  nee-21.pdf              1.5   2
6   6 avan-22.pdf              1.5   1
7   7  nee-22.pdf              1.5   1
8   8  nrg-21.pdf       1.5 degree   2
9   9  nrg-22.pdf       1.5 degree   2
10 10  nrg-21.pdf      1.5 degrees   3
11 11  nee-21.pdf      1.5 degrees   1
12 12  nee-22.pdf      1.5 degrees   1
13 13  nee-21.pdf       1.5-degree   1
14 14 enel-22.pdf             1.50   1
15 15 enel-22.pdf          1.52030   1
16 16 enel-22.pdf          1.52040   1
17 17 enel-22.pdf             1.53   1
18 18 enel-21.pdf            1.558   1
19 19  nee-21.pdf             1.56   2
20 20 enel-22.pdf             1.56   1
21 21 enel-21.pdf            1.565   1
22 22 enel-22.pdf             1.58   1
23 23 enel-21.pdf            1.580   1
24 24 enel-22.pdf            1.5is   1
25 25 enel-21.pdf          CLIMATE   1
26 26 enel-22.pdf          CLIMATE   1
27 27  nrg-21.pdf          CLIMATE   1
28 28 enel-22.pdf             IPCC  14
29 29 avan-21.pdf             IPCC   8
30 30 avan-22.pdf             IPCC   8
31 31 enel-21.pdf             IPCC   8
32 32  nee-21.pdf             IPCC   2
33 33 enel-22.pdf           UNFCCC   2
34 34 enel-22.pdf          climate 553
35 35 enel-21.pdf          climate 421
36 36  nee-21.pdf          climate 128
37 37  nee-22.pdf          climate 111
38 38  nrg-22.pdf          climate  54
39 39 avan-22.pdf          climate  49
40 40  nrg-21.pdf          climate  45
41 41 avan-21.pdf          climate  29
42 42  nee-21.pdf         climate-   2
43 43  nrg-21.pdf         climate-   2
44 44 enel-21.pdf         climate-   1
45 45  nee-22.pdf         climate-   1
46 46  nrg-22.pdf         climate-   1
47 47 enel-22.pdf    climate-aware   1
48 48  nee-21.pdf   climate-change   3
49 49  nee-22.pdf   climate-change   2
50 50 enel-22.pdf climate-changing   3

I'd like to arrange this data by docname and combine values for keywords 1.5, 1.5 degree, and 1.5 degrees and remove/ exclude other rows that have 1.55, 1.56 and other numbers. Similarly, I'd like to combine all rows and their values that begin with climate or climate-. Using tidyr::pivot_wider or some other function, I'd like the final data to look like this:

keyword avan-21.pdf avan-22.pdf enel-21.pdf enel-22.pdf nee-21.pdf
1.5 0       0       66      99      2
climate 53      71      669     967     1

My ultimate aim is to calculate term frequency for each dictionary category.


Solution

  • I'm not sure about your actual expected values (I can't find 53, for instance), but perhaps this will give you enough to finesse the last few steps.

    library(dplyr)
    library(tidyr) # pivot_wider
    climate_change %>%
      mutate(keyword = sub("[- ]+degrees?$", "", keyword)) %>%
      filter(keyword %in% c("1.5", "climate")) %>%
      summarize(n = sum(n), .by = c(docname, keyword)) %>%
      pivot_wider(id_cols = keyword, names_from = docname, values_from = n)
    # # A tibble: 2 × 9
    #   keyword `enel-22.pdf` `enel-21.pdf` `nrg-21.pdf` `nrg-22.pdf` `nee-21.pdf` `avan-22.pdf` `nee-22.pdf` `avan-21.pdf`
    #   <chr>           <int>         <int>        <int>        <int>        <int>         <int>        <int>         <int>
    # 1 1.5                97            66           12            6            4             1            2            NA
    # 2 climate           553           421           45           54          128            49          111            29
    

    If you want to combine all keywords that start with climate including climate-changing, then adjust the sub inside the mutate to the follow (remaining lines unchanged):

    climate_change %>%
      mutate(keyword = sub("([- ]+degrees?|-[a-z]*)$", "", keyword)) %>%
      filter(keyword %in% c("1.5", "climate")) %>%
      summarize(n = sum(n), .by = c(docname, keyword)) %>%
      pivot_wider(id_cols = keyword, names_from = docname, values_from = n)
    # # A tibble: 2 × 9
    #   keyword `enel-22.pdf` `enel-21.pdf` `nrg-21.pdf` `nrg-22.pdf` `nee-21.pdf` `avan-22.pdf` `nee-22.pdf` `avan-21.pdf`
    #   <chr>           <int>         <int>        <int>        <int>        <int>         <int>        <int>         <int>
    # 1 1.5                97            66           12            6            4             1            2            NA
    # 2 climate           557           422           47           55          133            49          114            29
    

    Data

    climate_change <- structure(list(X = 1:50, docname = c("enel-22.pdf", "enel-21.pdf", "nrg-21.pdf", "nrg-22.pdf", "nee-21.pdf", "avan-22.pdf", "nee-22.pdf", "nrg-21.pdf", "nrg-22.pdf", "nrg-21.pdf", "nee-21.pdf", "nee-22.pdf", "nee-21.pdf", "enel-22.pdf", "enel-22.pdf", "enel-22.pdf", "enel-22.pdf", "enel-21.pdf", "nee-21.pdf", "enel-22.pdf", "enel-21.pdf", "enel-22.pdf", "enel-21.pdf", "enel-22.pdf", "enel-21.pdf", "enel-22.pdf", "nrg-21.pdf", "enel-22.pdf", "avan-21.pdf", "avan-22.pdf", "enel-21.pdf", "nee-21.pdf",  "enel-22.pdf", "enel-22.pdf", "enel-21.pdf", "nee-21.pdf", "nee-22.pdf", "nrg-22.pdf", "avan-22.pdf", "nrg-21.pdf", "avan-21.pdf", "nee-21.pdf", "nrg-21.pdf", "enel-21.pdf", "nee-22.pdf", "nrg-22.pdf", "enel-22.pdf", "nee-21.pdf", "nee-22.pdf", "enel-22.pdf"), keyword = c("1.5", "1.5", "1.5", "1.5", "1.5", "1.5", "1.5", "1.5 degree", "1.5 degree", "1.5 degrees", "1.5 degrees", "1.5 degrees", "1.5-degree", "1.50", "1.52030", "1.52040", "1.53", "1.558", "1.56", "1.56", "1.565", "1.58", "1.580", "1.5is",  "CLIMATE", "CLIMATE", "CLIMATE", "IPCC", "IPCC", "IPCC", "IPCC", "IPCC", "UNFCCC", "climate", "climate", "climate", "climate", "climate", "climate", "climate", "climate", "climate-", "climate-", "climate-", "climate-", "climate-", "climate-aware", "climate-change", "climate-change", "climate-changing"), n = c(97L, 66L, 7L, 4L, 2L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 14L, 8L, 8L, 8L, 2L, 2L, 553L, 421L, 128L, 111L, 54L, 49L, 45L, 29L, 2L, 2L, 1L,  1L, 1L, 1L, 3L, 2L, 3L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50"))