Search code examples
ruby-on-railselasticsearchelasticsearch-railselasticsearch-5elasticsearch-model

How do you use the ingest-attachment plugin with elasticsearch-rails?


I was previously using the mapper-attachments plugin that is now deprecated, which was fairly easy to use along with normal indexing. Now that ingest-attachment has replaced it and requires a pipeline, etc. it has become confusing on how to properly use this.

Lets say I have a model named Media, that has a file field containing the base64 encoded file. I have the following mappings in that file:

mapping '_source' => { :excludes => ['file'] } do
  indexes :id, type: :long, index: :not_analyzed
  indexes :name, type: :text
  indexes :visibility, type: :integer, index: :not_analyzed
  indexes :created_at, type: :date, include_in_all: false
  indexes :updated_at, type: :date, include_in_all: false

  # attachment specific mappings
  indexes 'attachment.title', type: :text, store: 'yes'
  indexes 'attachment.author', type: :text, store: 'yes'
  indexes 'attachment.name', type: :text, store: 'yes'
  indexes 'attachment.date', type: :date, store: 'yes'
  indexes 'attachment.content_type', type: :text, store: 'yes'
  indexes 'attachment.content_length', type: :integer, store: 'yes'
  indexes 'attachment.content', term_vector: 'with_positions_offsets', type: :text, store: 'yes'
end

I have created an attachment pipeline via curl:

curl -XPUT 'localhost:9200/_ingest/pipeline/attachment' -d'
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "file"
      }
    }
  ]
}'

Now, previously a simple Media.last.__elasticsearch__.index_document would have been sufficient to index a record along with the actual file via the mapper-attachments plugin.

I'm not sure how to do this with ingest-attachment using a pipeline and the elasticsearch-rails gem.

I can do the following PUT via curl:

curl -XPUT 'localhost:9200/assets/media/68?pipeline=attachment' -d'
{ "file" : "my_really_long_encoded_file_string" }'

This will index the encoded file but obviously it doesn't index the rest of the model's data (or overwrites it completely if it was previously indexed). I don't really want to have to include every single model attribute along with the file in a curl command. Are there better or simpler ways of doing this? Am I just completely off with out pipelines and ingest are supposed to work?


Solution

  • Finally figured this out. I needed up to update the ES gems, specifically elasticsearch-api.

    With the mappings and pipeline set as I have it, you can easily just do:

    Media.last.__elasticsearch__.index_document pipeline: :attachment

    or

    Media.last.__elasticsearch__.update_document pipeline: :attachment

    This will index everything correctly and your file will be properly parsed and indexed via the ingest pipeline.