I did text mining using Quanteda package. Using kwic feature, I generated this output that identifies keyword
s in my dictionary and identifies which keywords matched within each key. The data looks like this:
docname keyword my_dict
<chr> <chr> <chr>
1 avan-21.pdf sustainable transition
2 avan-21.pdf electricity low_carbon_energy
3 avan-21.pdf electricity low_carbon_energy
4 avan-21.pdf renewable low_carbon_energy
5 avan-21.pdf electricity low_carbon_energy
6 avan-21.pdf wind low_carbon_energy
7 avan-21.pdf wind low_carbon_energy
8 avan-21.pdf solar low_carbon_energy
9 avan-21.pdf emissions emissions
10 avan-21.pdf emissions-free emissions
11 avan-21.pdf sustainable transition
12 avan-21.pdf renewable low_carbon_energy
13 avan-21.pdf wind low_carbon_energy
14 avan-21.pdf solar low_carbon_energy
15 avan-21.pdf biomass low_carbon_energy
16 avan-21.pdf sustainability transition
17 avan-21.pdf sustainability transition
18 avan-21.pdf sustainability transition
19 avan-21.pdf sustainability transition
20 avan-21.pdf sustainability transition
I filtered this data by dictionary keys (my_dict) to create sub-categories like this:
climate_change <- kwic2filter %>%
filter(my_dict == "climate_change") %>%
select(docname, keyword) %>%
group_by(docname, keyword) %>%
count(keyword, sort = TRUE) %>%
arrange(keyword, desc(n))%>%
write.csv("energy-output/climate_change.csv")
Results look like this:
X docname keyword n
1 1 enel-22.pdf 1.5 97
2 2 enel-21.pdf 1.5 66
3 3 nrg-21.pdf 1.5 7
4 4 nrg-22.pdf 1.5 4
5 5 nee-21.pdf 1.5 2
6 6 avan-22.pdf 1.5 1
7 7 nee-22.pdf 1.5 1
8 8 nrg-21.pdf 1.5 degree 2
9 9 nrg-22.pdf 1.5 degree 2
10 10 nrg-21.pdf 1.5 degrees 3
11 11 nee-21.pdf 1.5 degrees 1
12 12 nee-22.pdf 1.5 degrees 1
13 13 nee-21.pdf 1.5-degree 1
14 14 enel-22.pdf 1.50 1
15 15 enel-22.pdf 1.52030 1
16 16 enel-22.pdf 1.52040 1
17 17 enel-22.pdf 1.53 1
18 18 enel-21.pdf 1.558 1
19 19 nee-21.pdf 1.56 2
20 20 enel-22.pdf 1.56 1
21 21 enel-21.pdf 1.565 1
22 22 enel-22.pdf 1.58 1
23 23 enel-21.pdf 1.580 1
24 24 enel-22.pdf 1.5is 1
25 25 enel-21.pdf CLIMATE 1
26 26 enel-22.pdf CLIMATE 1
27 27 nrg-21.pdf CLIMATE 1
28 28 enel-22.pdf IPCC 14
29 29 avan-21.pdf IPCC 8
30 30 avan-22.pdf IPCC 8
31 31 enel-21.pdf IPCC 8
32 32 nee-21.pdf IPCC 2
33 33 enel-22.pdf UNFCCC 2
34 34 enel-22.pdf climate 553
35 35 enel-21.pdf climate 421
36 36 nee-21.pdf climate 128
37 37 nee-22.pdf climate 111
38 38 nrg-22.pdf climate 54
39 39 avan-22.pdf climate 49
40 40 nrg-21.pdf climate 45
41 41 avan-21.pdf climate 29
42 42 nee-21.pdf climate- 2
43 43 nrg-21.pdf climate- 2
44 44 enel-21.pdf climate- 1
45 45 nee-22.pdf climate- 1
46 46 nrg-22.pdf climate- 1
47 47 enel-22.pdf climate-aware 1
48 48 nee-21.pdf climate-change 3
49 49 nee-22.pdf climate-change 2
50 50 enel-22.pdf climate-changing 3
I'd like to arrange this data by docname
and combine values for keyword
s 1.5, 1.5 degree, and 1.5 degrees and remove/ exclude other rows that have 1.55, 1.56 and other numbers. Similarly, I'd like to combine all rows and their values that begin with climate
or climate-
. Using tidyr::pivot_wider
or some other function, I'd like the final data to look like this:
keyword avan-21.pdf avan-22.pdf enel-21.pdf enel-22.pdf nee-21.pdf
1.5 0 0 66 99 2
climate 53 71 669 967 1
My ultimate aim is to calculate term frequency for each dictionary category.
I'm not sure about your actual expected values (I can't find 53
, for instance), but perhaps this will give you enough to finesse the last few steps.
library(dplyr)
library(tidyr) # pivot_wider
climate_change %>%
mutate(keyword = sub("[- ]+degrees?$", "", keyword)) %>%
filter(keyword %in% c("1.5", "climate")) %>%
summarize(n = sum(n), .by = c(docname, keyword)) %>%
pivot_wider(id_cols = keyword, names_from = docname, values_from = n)
# # A tibble: 2 × 9
# keyword `enel-22.pdf` `enel-21.pdf` `nrg-21.pdf` `nrg-22.pdf` `nee-21.pdf` `avan-22.pdf` `nee-22.pdf` `avan-21.pdf`
# <chr> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1.5 97 66 12 6 4 1 2 NA
# 2 climate 553 421 45 54 128 49 111 29
If you want to combine all keywords that start with climate
including climate-changing
, then adjust the sub
inside the mutate to the follow (remaining lines unchanged):
climate_change %>%
mutate(keyword = sub("([- ]+degrees?|-[a-z]*)$", "", keyword)) %>%
filter(keyword %in% c("1.5", "climate")) %>%
summarize(n = sum(n), .by = c(docname, keyword)) %>%
pivot_wider(id_cols = keyword, names_from = docname, values_from = n)
# # A tibble: 2 × 9
# keyword `enel-22.pdf` `enel-21.pdf` `nrg-21.pdf` `nrg-22.pdf` `nee-21.pdf` `avan-22.pdf` `nee-22.pdf` `avan-21.pdf`
# <chr> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1.5 97 66 12 6 4 1 2 NA
# 2 climate 557 422 47 55 133 49 114 29
Data
climate_change <- structure(list(X = 1:50, docname = c("enel-22.pdf", "enel-21.pdf", "nrg-21.pdf", "nrg-22.pdf", "nee-21.pdf", "avan-22.pdf", "nee-22.pdf", "nrg-21.pdf", "nrg-22.pdf", "nrg-21.pdf", "nee-21.pdf", "nee-22.pdf", "nee-21.pdf", "enel-22.pdf", "enel-22.pdf", "enel-22.pdf", "enel-22.pdf", "enel-21.pdf", "nee-21.pdf", "enel-22.pdf", "enel-21.pdf", "enel-22.pdf", "enel-21.pdf", "enel-22.pdf", "enel-21.pdf", "enel-22.pdf", "nrg-21.pdf", "enel-22.pdf", "avan-21.pdf", "avan-22.pdf", "enel-21.pdf", "nee-21.pdf", "enel-22.pdf", "enel-22.pdf", "enel-21.pdf", "nee-21.pdf", "nee-22.pdf", "nrg-22.pdf", "avan-22.pdf", "nrg-21.pdf", "avan-21.pdf", "nee-21.pdf", "nrg-21.pdf", "enel-21.pdf", "nee-22.pdf", "nrg-22.pdf", "enel-22.pdf", "nee-21.pdf", "nee-22.pdf", "enel-22.pdf"), keyword = c("1.5", "1.5", "1.5", "1.5", "1.5", "1.5", "1.5", "1.5 degree", "1.5 degree", "1.5 degrees", "1.5 degrees", "1.5 degrees", "1.5-degree", "1.50", "1.52030", "1.52040", "1.53", "1.558", "1.56", "1.56", "1.565", "1.58", "1.580", "1.5is", "CLIMATE", "CLIMATE", "CLIMATE", "IPCC", "IPCC", "IPCC", "IPCC", "IPCC", "UNFCCC", "climate", "climate", "climate", "climate", "climate", "climate", "climate", "climate", "climate-", "climate-", "climate-", "climate-", "climate-", "climate-aware", "climate-change", "climate-change", "climate-changing"), n = c(97L, 66L, 7L, 4L, 2L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 14L, 8L, 8L, 8L, 2L, 2L, 553L, 421L, 128L, 111L, 54L, 49L, 45L, 29L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 2L, 3L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50"))