How to calculate relevance in Elasticsearch based on associated documents

Main question:

I have one data type, let's call them People, and another associated data type, let's say Reports. A person can have many reports associated with them and vice versa in our relational database. These reports can be pretty long, sometimes over 1000 words mostly in English.

We want to be able to search people by a keyword, where the results are the people whose reports are most relevant to the keyword. For example if Person A's reports mention "art director" a lot more than any other person, we want Person A to show up high in the search results if someone searched "art director".

More details:

The key thing here, is that we don't want to combine all the reports together and add them as a field for the Person model. I think that with 100,000s of our People records and 1,000,000s of long reports, this would make the index super big. (And I think there might be limits on how long the text of a field can be.)

The reports are also indexed on their own, so that people can do full text searches of all the reports without considering the People records. This works great already.

To avoid these large and kind of redundant indexes, I want to use the Elasticsearch query language to search "through" the Person record to its associated reports.

Is this possible, and if so how?

P.S. I am using the Searchkick Ruby gem to generate the Elasticsearch queries through an API. But I can also use the Elasticsearch DSL directly if necessary.

Solution

Answering to your questions.

1.(...) we want Person A to show up high in the search results if someone searched "art director".

That's exactly what Elasticsearch does, so I would recommend you to start with a simple match query:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

From there you can start adding up more complexity.

Elasticsearch uses TF-IDF which means:

TF(Term Frequency): The most frequent a term is within a document, more relevant it is.

IDF(Inverse Document Frequency): The most frequent a term is across the entire dataset the less relevant it is.

2.(...) To avoid these large and kind of redundant indexes, I want to use the Elasticsearch query language to search "through" the Person record to its associated reports.

You are right. The recommendation is not indexing a book as a field, but index the different chapter/pages/etc.. as documents.

https://www.elastic.co/guide/en/elasticsearch/reference/current/general-recommendations.html

There are some structures you can use. Which one to use will depend on how big is the scale of your data, en how do you want to show this data to your users.

The structures are:

Joined field type (parent=author child=report pages)
Nested field type (array of report pages within an author)
Collapsed results (each doc being a book page, collapse by author)

We can discuss a lot about the best one, but I invite you to try yourself.

Some guidelines:

If the number of reports outnumber for a lot to the author you can use joined field type.