I'm using ElasticSearch 0.90.7, so the answer to What exactly does the Standard tokenfilter do in Elasticsearch? I do not think applies (however what I'm seeing is similar).
I build the following:
curl -XDELETE "http://localhost:9200/testindex"
curl -XPOST "http://localhost:9200/testindex" -d'
{
"mappings" : {
"article" : {
"properties" : {
"text" : {
"type" : "string"
}
}
}
}
}'
I populate the following:
curl -XPUT "http://localhost:9200/testindex/article/1" -d'{
"text": "file name. pdf"
}'
curl -XPUT "http://localhost:9200/testindex/article/2" -d'{
"text": "file name.pdf"
}'
Search returns the following:
curl -XPOST "http://localhost:9200/testindex/_search" -d '{
"fields": [],
"query": {
"query_string": {
"default_field": "text",
"query": "\"file name\""
}
}
}'
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "testindex",
"_type": "article",
"_id": "1",
"_score": 0.30685282
}
]
}
}
... given this, I'm guessing that the standard tokenizer is changing document #2 from file name.pdf into file namepdf
My questions are:
You can check for yourself using the Analyze API.
This yields the tokens file
, name
, and pdf
for "file name .pdf"
,
and the tokens file
, and name.pdf
for "file name.pdf"
.
The StandardAnalyzer, or rather the StandardTokenizer, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29, which says:
Do not break within sequences, such as “3.2”
So, "name.pdf"
is considered a full word by the StandardTokenizer.
For your Query, the SimpleAnalyzer would work. You can use the Analyze API as well as the elasticsearch-inquisitor plugin to test the available analyzers.