What I need is a function to find words within a certain 'word distance'. The words 'bag' and 'tool' are interesting in a sentence "He had a bag of tools in his car."
With the Quanteda kwic function I can find 'bag' and 'tool' individually, but this often gives me an overload of results. I need e.g. 'bag' and 'tools' within five words from eachother.
You can use the fcm()
function to count the co-occurrences within a fixed window, for instance 5 words. This creates a "feature co-occurrence matrix" and can be defined for any size of token span, or for the context of an entire document.
For your example, or at least an example based on my interpretation of your questions, this would look like:
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- c(
d1 = "He had a bag of tools in his car",
d2 = "bag other other other other tools other"
)
fcm(txt, context = "window", window = 5)
## Feature co-occurrence matrix of: 10 by 10 features.
## 10 x 10 sparse Matrix of class "fcm"
## features
## features He had a bag of tools in his car other
## He 0 1 1 1 1 1 0 0 0 0
## had 0 0 1 1 1 1 1 0 0 0
## a 0 0 0 1 1 1 1 1 0 0
## bag 0 0 0 0 1 2 1 1 1 4
## of 0 0 0 0 0 1 1 1 1 0
## tools 0 0 0 0 0 0 1 1 1 5
## in 0 0 0 0 0 0 0 1 1 0
## his 0 0 0 0 0 0 0 0 1 0
## car 0 0 0 0 0 0 0 0 0 0
## other 0 0 0 0 0 0 0 0 0 10
Here, the term bag occurs once within 5 tokens of tool, in the first document. In the second document, they are more than 5 tokens apart, so this is not counted.