Search code examples
elasticsearch-railselasticsearch-modelelasticsearch-ruby

custom mapping for mapper attachment type with elasticsearch-persistence ruby


In my project I store data in active record model and index html document in elasticsearch using mapper-attachments plugin. My document mapping look like this:

include Elasticsearch::Model

settings index: { number_of_shards: 5 } do
  mappings do
    indexes :alerted
    indexes :title, analyzer: 'english', index_options: 'offsets'
    indexes :summary, analyzer: 'english', index_options: 'offsets'
    indexes :content, type: 'attachment', fields: { 
                                                    author: { index: "no"},
                                                    date: { index: "no"},
                                                    content: { store: "yes",
                                                               type: "string",
                                                               term_vector: "with_positions_offsets"
                                                            }
                                                  }
  end
end

I run a query to double check my doc mapping and the result:

    "mappings": {
          "feed_entry": {
              "properties": {
                  "content": {
                      "type": "attachment",
                      "path": "full",
                      "fields": {
                          "content": {
                              "type": "string",
                              "store": true,
                              "term_vector": "with_positions_offsets"
                          },

It works great (the type: 'attachment' above). I can do the search through html doc perfectly.

I have a performance problem with activerecord which is mysql and I don't really need to store it in database so I decide to migrate to store in elasticsearch.

I am doing an experiment with elasticsearch-persistence gem.

I configure the mapping as below:

include Elasticsearch::Persistence::Model
attribute :alert_id, Integer
attribute :title, String, mapping: { analyzer: 'english' }
attribute :url, String, mapping: { analyzer: 'english' }
attribute :summary, String, mapping: { analyzer: 'english' }
attribute :alerted, Boolean, default: false, mapping: { analyzer: 'english' }
attribute :fingerprint, String, mapping: { analyzer: 'english' }
attribute :feed_id, Integer
attribute :keywords

attribute :content, nil, mapping: { type: 'attachment', fields: { 
                                                      author: { index: "no"},
                                                      date: { index: "no"},
                                                      content: { store: "yes",
                                                                 type: "string",
                                                                 term_vector: "with_positions_offsets"
                                                              }
                                                    }

but when i do a query to mapping i got something like this:

"mappings": {
        "entry": {
            "properties": {
                "content": {
                    "properties": {
                        "_content": {
                            "type": "string"
                        },
                        "_content_type": {
                            "type": "string"
                        },
                        "_detect_language": {
                            "type": "boolean"
                        },

which is wrong. can anyone tell me how to do a mapping with attachment type ?

Really appreciate your help.


Solution

  • In the mean time, I have to hard-code it this way:

      def self.recreate_index!
        mappings = {}
        mappings[FeedEntry::ELASTIC_TYPE_NAME]= {
    
                    "properties": {
                      "alerted": {
                        "type": "boolean"
                      },
                      "title": {
                        #for exact match
                        "index": "not_analyzed",
                        "type": "string"
                      },
                      "url": {
                        "index": "not_analyzed",
                        "type": "string"
                      },                      
                      "summary": {
                        "analyzer": "english",
                        "index_options": "offsets",
                        "type": "string"
                      },
                      "content": {
                        "type": "attachment",
                        "fields": {
                          "author": {
                            "index": "no"
                          },
                          "date": {
                            "index": "no"
                          },
                          "content": {
                            "store": "yes",
                            "type": "string",
                            "term_vector": "with_positions_offsets"
                          }
                        }
                      }
                    }
              }
        options = {
          index: FeedEntry::ELASTIC_INDEX_NAME,
        }
        self.gateway.client.indices.delete(options) rescue nil
        self.gateway.client.indices.create(options.merge( body: { mappings: mappings}))   
      end
    

    And then override the to_hash method

      def to_hash(options={})
        hash = self.as_json
        map_attachment(hash) if !self.alerted
        hash
      end
    
      # encode the content to Base64 formatj
      def map_attachment(hash)
        hash["content"] = {
          "_detect_language": false,
          "_language": "en",
          "_indexed_chars": -1 ,
          "_content_type": "text/html",
          "_content": Base64.encode64(self.content)
        }
        hash
      end
    

    Then I have to call

    FeedEntry.recreate_index! 
    

    before hand to create the mapping for elastic search. Becareful when you update the document you might end up with double base64 encoding of the content field. In my scenario, I checked the alerted field.