Search code examples
multithreadingsphinx

Stack overflow parallel updating one RT index


Is it possible to update one Real-time Sphinx index in parallel?

To clarify, I have one RT index, named e.g. RT1. To update RT1, I want to have two or even more updaters.

For instance, if I have 100 files in the queue, I want to add 2 files in parallel to the index. Is Sphinx capable of multi-threading or is Sphinx not thread-safe?

The main question is, will Sphinx corrupt itself when multiple files are being added to the same index at the same time? I wasn't able to find the answer in the documentation.

Good to keep in mind, I multithread my script which is adding files to the Sphinx RT-index. Therefore, multiple files will be added at the same time (in parallel) to one index.

Version:

Sphinx 2.2.9-id64-release (rel22-r5006)

Config:

index_name
{
  type            = rt
  path            = /mnt/data001/index_name
  rt_field        = FileName
  rt_field        = FileExtension
  rt_field        = FileContent
  rt_field        = FileTags
  rt_attr_uint    = FileReference
  rt_attr_uint    = FileSize
  rt_attr_uint    = LastModified
  rt_attr_uint    = LastModifiedYear
  rt_attr_uint    = LastModifiedMonth
  rt_attr_uint    = LastModifiedDay
  rt_attr_string  = FileContent
  rt_mem_limit    = 1024M
  charset_table   = A..Z, a..z, 0..9, U+E1, U+E9, U+FA
  ondisk_attrs    = pool
}

searchd
{
  listen                = 9306:mysql41
  log                   = /var/log/sphinxsearch/searchd.log
  read_timeout          = 5
  max_children          = 30
  pid_file              = /var/run/searchd.pid
  max_packet_size       = 128M
  binlog_path           = /mnt/data001
}

Important to note that the string can only consist of A..Z, a..z, 0..9, U+E1, U+E9 and U+FA. (I have verified this)

Test: for the test I used a C++ application on Ubuntu communicating with Sphinx through the MySQL connectors


Solution

  • I have verified this issue and be warned! Updating the index in parallel is not possible! My index corrupted itself partially (both the index and daemon didn't crash). You will not see this issue at a first glance. I have verified this by inserting and checking the inserted value (by directly selecting it after insertion) and the returned value didn't always match the inserted value as outlined below.

    As an example for clarification. I inserted test but I got back t^463t from the select (directly after the insertion was performed).

    For this test I have inserted 1.000.000 documents spread over a two-threaded application whereof 43.372 documents had this issue outlined above. This of course is dependent of the exact rate of parallel inserted documents, but Sphinx doesn't seem to be thread safe. (assumed is that even more documents will get corrupted when using more threads to insert documents in parallel)

    Sometimes I also noticed that words from multiple documents where concatenated (those documents where inserted at the exact same moment).