architecture lucene distributed cqrs event-sourcing

Duplicating all data in Lucene index

We are creating lucene indexes from data being stored in event store as a stream of events. Those indexes are used to provide efficient paging/sorting/search capabilities with our data.

It happens that we have to duplicate all data in indexes in order to fulfill our requirements. What is conceptually the best way to query data in this situation?

I see 2 options:

query all data for building view models directly from index
query only list of ids from index and use those ids to query data from event store

We are concerned about scalability and fault tolerance as well, so I have to think about those also. Any suggestions?

Solution

I guess option #1 is better. Store data in index, only those pieces that You need to build model from in paged/filtered table. And fetch them from there. It's lightning fast.

Hibernate Search uses approach similar to option #2. It stores id and Java class, looks it up in index then fetches from DB. Although it can be circumvented when too costly. I had a case recently that I used it because default behaviour killed my DB. Works like a charm.

I never (across 4 projects) experienced index corruption but definately reindexing should be possible in the application.

Do You use event snapshots? They could be indexed as well.