Search code examples
unit-testingpython-3.xelasticsearchelasticsearch-py

Elasticsearch "get by index" returns the document, while "match_all" returns no results


I am trying to mock elasticsearch data for hosted CI unit-testing purposes.

I have prepared some fixtures that I can successfully load with bulk(), but then, for unknown reason, I cannot match anything, even though the test_index seemingly contains the data (because I can get() items by their IDs).

The fixtures.json is a subset of ES documents that I fetched from real production index. With real world index, everything works as expected and all tests pass.

An artificial example of the strange behaviour follows:

class MyTestCase(TestCase):
    es = Elasticsearch()

    @classmethod
    def setUpClass(cls):
        super().setUpClass()
        cls.es.indices.create('test_index', SOME_SCHEMA)

        with open('fixtures.json') as fixtures:
            bulk(cls.es, json.load(fixtures))

    @classmethod
    def tearDownClass(cls):
        super().tearDownClass()
        cls.es.indices.delete('test_index')

    def test_something(self):
        # check all documents are there:
        with open('fixtures.json') as fixtures:
            for f in json.load(fixtures):
                print(self.es.get(index='test_index', id=f['_id']))
                # yes they are!

        # BUT:
        match_all = {"query": {"match_all": {}}}
        print('hits:', self.es.search(index='test_index', body=match_all)['hits']['hits'])
        # prints `hits: []` like there was nothing in

        print('count:', self.es.count(index='test_index', body=match_all)['count'])
        # prints `count: 0`

Solution

  • While I can completely understand your pain (everything works except for the tests), the answer is actually quite simple: the tests, in contrast to your experiments, are too quick.

    • Elasticsearch is near real-time search engine, which means there is up to 1s delay between indexing a document and it being searchable.
    • There is also unpredictable delay (depending on actual overhead) between creating an index and it being ready.

    So the fix would be time.sleep() to give ES some space to create all the sorcery it needs to give you results. I would do this:

    @classmethod
    def setUpClass(cls):
        super().setUpClass()
        cls.es.indices.create('test_index', SOME_SCHEMA)
    
        with open('fixtures.json') as fixtures:
            bulk(cls.es, json.load(fixtures))
    
        cls.wait_until_index_ready()
    
    @classmethod
    def wait_until_index_ready(cls, timeout=10):
        for sec in range(timeout):
            time.sleep(1)
            if cls.es.cluster.health().get('status') in ('green', 'yellow'):
                break