I have ingested a pdf into Elastic with a command:
curl -s -X PUT -H "Content-Type: application/json" -u "$user:$pwd"
-d "@$json_file" "$host/$index/_doc/$entree?pipeline=attachment"
the pdf has this pdfinfo
:
Title: t416. Urbanisme : la loi ELAN
Subject:
Keywords: ELAN, construction, marchand de sommeil, lutte contre les recours abusifs
Author: Marc Le Bihan
Creator: LaTeX via pandoc
Producer: pdfTeX-1.40.24
CreationDate: Fri Nov 10 04:56:26 2023 CET
ModDate: Fri Nov 10 04:56:26 2023 CET
Custom Metadata: yes
Metadata Stream: no
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 612 x 792 pts (letter)
Page rot: 0
File size: 90038 bytes
Optimized: no
PDF version: 1.5
When I'm querying my index with the word abusifs
, plural of abusif
in French:
GET apprentissage/_search
{
"query": {
"query_string": {
"query": "abusifs"
}
},
"_source": {
"includes": [ "attachment.modified", "attachment.title", "attachment.content"]
}
}
it finds the entry:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 7.6852083,
"hits": [
{
"_index": "apprentissage",
"_id": "t416-urbanisme-la_loi_ELAN",
"_score": 7.6852083,
"_ignored": [
"attachment.content.keyword",
"data.keyword"
],
"_source": {
"attachment": {
"modified": "2023-11-10T03:56:26Z",
"title": "t416. Urbanisme : la loi ELAN",
"content": """t416. Urbanisme : la loi ELAN
Loi portant Évolution du Logement, de l’Aménagement et du Numérique
Marc Le Bihan
23/11/2018 : Loi portant évolution du logement, de l’aménagement et du numérique
(ELAN) :
[...]
2) Lutte contre les recours abusifs
[...]
}
}
}
]
}
}
But if I attempt to query only its singular form, abusif
, it finds nothing:
GET apprentissage/_search
{
"query": {
"query_string": {
"query": "abusif"
}
},
"_source": {
"includes": [ "attachment.modified", "attachment.title", "attachment.content"]
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
I was expecting the ingester to detect itself the language used, did it failed?
Should I set that language more imperatively, either in my ingesting command, either inside the pdf?
because my document doesn't look being indexed in French.
But maybe it's my query that isn't the good one to perform my research?
The /apprentissage
index where the documents are ingested:
{
"apprentissage": {
"aliases": {},
"mappings": {
"properties": {
"attachment": {
"properties": {
"author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"content_length": {
"type": "long"
},
"content_type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"creator_tool": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"date": {
"type": "date"
},
"format": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"keywords": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"language": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"modified": {
"type": "date"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"data": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"settings": {
"index": {
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"number_of_shards": "1",
"provided_name": "apprentissage",
"creation_date": "1694840235250",
"number_of_replicas": "1",
"uuid": "yMn4iKJxT42s5gOX2rFZYw",
"version": {
"created": "8100099"
}
}
}
}
}
My ingest script:
#!/bin/bash
export source=$1
# Le paramètre source doit être alimenté
if [ -z "$source" ]; then
echo "Le nom du fichier pdf à indexer dans Elastic est attendu en paramètre." >&2
exit 1
fi
# Si le fichier source n'a pas d'extension, lui rajouter celle .pdf
if [[ "$source" != *"."* ]]; then
source=$source.pdf
fi
# Il doit avoir l'extension pdf
if [[ "$source" != *".pdf" ]]; then
echo "Le fichier à indexer dans Elastic doit avoir l'extension .pdf" >&2
exit 1
fi
host="http://localhost:9200"
user="elastic"
pwd="...."
index=apprentissage
entree=$(basename "${source%.*}")
json_file=$(mktemp)
cur_url="$host/$index/_doc/$entree?pipeline=attachment"
echo '{"data" : "'"$( base64 "$source" -w 0 )"'"}' >"$json_file"
# echo "transfert via $json_file vers $cur_url"
if ! ingest=$(curl -s -X PUT -H "Content-Type: application/json" -u "$user:$pwd" -d "@$json_file" "$cur_url"); then
echo "Echec de l'ingestion dans Elastic de $source : $ingest" >&2
exit $?
fi
rm "$json_file"
echo "$source indexé dans Elastic"
According to your mapping, the attachment.content
field is analyzed by the standard
analyzer since no other analyzer is specified. The standard
analyzer is not French-aware, and hence, doesn't do any French stemming, so abusif
and abusifs
are two different words. Hence the results you're seeing.
If you know you'll be indexing only French content, you can make your content field French-aware by using a French analyzer that does proper stemming.
You need to recreate your index with the following mapping
"content": {
"type": "text",
"analyzer": "french", <--- add this analyzer
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
Then you need to reindex your content, and when done, your search query will work as you expect and will find the document when searching for both abusifs
and abusif
c.q.f.d. ;-)