Working on Solr incremental data import from an existing normalized mssql database. I'm unable to decide on the strategy I need to implement, or not knowing whether there are existing tools to do the same, so I don't need to reinvent the wheel.
I need to import a document into Solr 3.6 to build a Solr Data, which is saved in MSSQL in heavily normalized fashion. To retrieve the data for single document, there are many joins required which is killing performance. I have appx. 1 million such document in db. So full import into Solr is not an option for me.
While deciding the approach I have two issues to consider:
I am looking after your help in deciding the strategy and tool for incremental data import into Solr. I think, I have following options:
Custom develop application to fetch data from MSSQL and pass it to Solr. I need to keep track of data as what all records are inserted into Solr and what are pending. Again, 2% data records in MSSQL keeps updated on daily basis, so need to track what data has changed since then, and then update them again at some point of time into Solr.
Use any existing tool or utility in Solr to do the same, like DIH. I'm not sure how this will address both of the issue of incremental data retrieval and how it will track what data has change in SQL server? Again, not sure how DIH will handle complex joins requires to fetch data from db.
Or use something like Lusql with DIH, bust still not sure about how it will address both the issues. Although Lusql will give ability to do complex joins in db, so I hope this might fit my purpose.
I'm in favor of using LuSQL with DIH in Solr, if it can fit the purpose, but still not sure how it keep track of what data has change? Or for this part I have to manage manually by maintaining the document id where the change is made, and then supplying it to LuSQL to fetch data from SQL and import into Solr.
I am also looking forward for your suggestions beyond this to handle this kind of situations.
I will share with you the way i do this.
Mainly I have the same requierements and until this week I used solr dataimport with delta imports. I have a program that regularly updates a status for the new items from 0 to 1 and then calls solr data import to get all the documents with status 1. Solrdataimport uses a stored procedure to join and get the documents with status 1 from db. If the import finish successfully I then update the status to 2 and I know that this documents are in solr. If a documents get changed I simply change from status 2 to status 0 and then the import process updates the document in solr.
Everything works fine for me using this process. I always get the new documents in solr without having to fetch all the data form the database.
Now my requirements have changed because we decide to keep the date archived in the database, as we only need it in solr. So I need to have a program that deserialize the data and then have it sent to solr.
My approach now is to add all the new/updated documents via update handler and after I added all the documents to commit them, and if the commit is successful then I update the status in the database. With this approach I have no experience yet so I don't know if it will work or not but I will just try and see what happens.
I researched in the past a better way to do this but I couldn't find anything so if you find a better solution please share it with me.
Good luck :)