mysql search elasticsearch diskspace overhead

Overhead of an elasticsearch river?

I am new to elasticsearch but don't really know how to think about the disk space & memory usage involved in setting up a river (a mysql river in my case).

What is the overhead involved in a river?; especially regarding diskspace & memory usage? This has been asked but not answered.

In other words, assume I have a table with 3 columns: primary_key (integer), url (varchar) and document_text (text). Also, assume I am currently doing full-text search 100% in mysql (stupid, I know, but just for argument sakes). Each of the 3 columns has an index on it, with "document_text" index being a full-text index. This is a very large table and I want to minimize duplicate data.

How should I think about what is going on w/ a mysql river? With a river, would I simply remove the full-text index from the "document_text" column & move that over to elasticsearch (along with the primary_key from mysql)? Elasticsearch would not need to index the "url", since we aren't searching on that, correct? The data for document_text is stored in mysql but the index stored in elasticsearch so the there is effectively a zero increase in the disk-space used?

EDIT:

I guess my main question is will I be storing the underlying data twice or does elasticsearch just store the index?

Solution

Elasticsearch by default creates a field named "_source" that contains all of the source data in json format that was submitted for indexing. This is where document data will be retrieved from when you query Elasticsearch:

_source

The _source field is an automatically generated field that stores the actual JSON that was used as the indexed document. It is not indexed (searchable), just stored. When executing "fetch" requests, like get or search, the _source field is returned by default.

Though very handy to have around, the source field does incur storage overhead within the index. For this reason, it can be disabled.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-source-field.html

You can turn this off if you want to but that means unless you then store the individual fields queries will not return the full document.

I'd need to know more about your specific use case to know if you should remove your full text index in mysql or not, but in general the primary reason to use the Elasticsearch mysql River is to facilitate text searches, so why do both?

Two other thoughts for you. First, other than for testing and development I'd avoid running ES on the same server as MySQL - they will compete with each other for cpu, disk space and especially memory and having both on the same server makes it's much harder to tune them well. Second, no matter what you do you will take up some significant space if you are indexing a large set of documents. You can limit this by playing the the _source field but in the end the ES index needs to index every token and it's location - it's not as much as storing the full document but it does take a non trivial amount of space. How much will depend on your documents - you would need to run a test indexing with your specific settings (field types, analysis chains and a reasonable sample of data) to nail down rough disk usage.