Can Elasticsearch's edgen_n_grams
be set up in a way that will build multi-word phrases as ES indexes crawled data?
I'd like to use those multi-word phrases as search suggestions for a small search app that I'm building.
I'm using Nutch to crawl some sites and using ES to index the crawled data.
I figured that since ES can split on split on whitespace
- that this shouldn't be that hard... however, I'm not getting the results I expected. So now I'm asking if this is even possible to do?
My ES index is setup like this
PUT /_template/autocomplete_1
{
"template": "auto*",
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "30",
"token_chars": ["letter","digit","whitespace"]
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"doc": {
"_all": {
"enabled": false
},
"properties": {
"anchor": {
"type": "string"
},
"boost": {
"type": "string"
},
"content": {
"type": "string",
"index_analyzer": "autocomplete_analyzer",
"search_analyzer": "standard"
},...
"content"
is the html body field per Nutch. I'm using 'content' as I figured it would generate the most phrases.
For creating multi-word phrases you need shingles. More specifically, this token filter that can combine tokens.