I have a file of artists' names. I'm trying to search the Art Institute of Chicago's REST API looking for works by those artists. One of the names in the file was 'Romare Beardon'. My Elasticsearch query found nothing.
criteria = {
"query": {"match_phrase" : {"artist_title": "romare beardon"}}
}
The problems are that the file misspelled the last name (s/b 'Bearden') and the Institute lists the artist's name as 'Romare Howard Bearden'.
So my query needs to forgive minor misspellings and account for middle names.
I have experimented with slop"
criteria = {
"query" : {"match_phrase": {"artist_title": {"query":"Romare Bearden", "slop":1 }}},
}
This correctly finds Romare Howard Bearden!
And I've experimented with fuzziness:
criteria = {
"query": {
"fuzzy" : {"artist_title": {"value": "Beardon", "fuzziness": "AUTO"}}},
}
This finds 'Bearden', but also finds 'Pearson'. Not acceptable.
Changing AUTO to 1 returns nothing. Changing AUTO to 2 returns 'Bearden' and 'Pearson'. Not acceptable.
Can anyone help me write a query that can take 'Romare Beardon' and find 'Romare Howard Bearden'. It needs to be generalized so that any first and last name, even if slightly misspelled, will find first middle last withl high precision.
I can think of one possible solution where you can split artist_title
text into words and then apply fuzziness on top of the search text as:
The following Whitespace analyzer can be specified while creating mapping for the elastic index:
{
"mappings": {
"properties": {
"artist_title": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
As mentioned above, the artist_title present in elastic are Romare Howard Bearden
and Pearson
.
For a search text: Romar Beardon
, the search query with fuzziness would be:
{
"query": {
"bool": {
"must": [
{
"fuzzy": {
"artist_title": {
"value": "Romar",
"fuzziness": "AUTO"
}
}
},
{
"fuzzy": {
"artist_title": {
"value": "Beardon",
"fuzziness": "AUTO"
}
}
}
]
}
}
}
This would give the intended result Romare Howard Bearden
with sloppy text and firstname and lastname combination.
Explanation:
During the mapping creation, the whitespace analyzer would break the text and index them separately which later when queried separately as fuzzy text would yield the result.
However, the search text needs to be split and added as fuzzy query separately. The fuzziness
value can be changed from AUTO to any integer number as to specify the number of fuzziness to be applied.
For AUTO
, according to Docs:
Generates an edit distance based on the length of the term. Low and high distance arguments may be optionally provided AUTO:[low],[high]. If not specified, the default values are 3 and 6, equivalent to AUTO:3,6 that make for lengths:
0..2 Must match exactly 3..5 One edit allowed >5 Two edits allowed
Alternatively, if "fuzziness": "2"
, the maximum number of fuzzy characters allowed are 2 in order to produce a search result. For instance, Roma
would also produce the search result as 2 characters r
and e
are missing.
Hope this helps.