In this article we talk about the current workers implemented in Manticore Search and how to tune the parameters of the workers.
In Manticore Search there are currently two multi-processing modes controlled by directive workers. Default MPM currently is thread_pool, with the alternative of threads.
Threads
In the multi-processing mode ‘threads’ a new dedicated threads is created for every incoming network connection. The thread is active until the client disconnect - that is, as long as the connection is alive. In this time, the thread will execute incoming queries from the connection.
By default, the engine will spawn an unlimited number of threads - if the operating system allows it. While this might look good, in reality allowing the creation of unlimited threads comes with some costs. First, creating/destruction a thread has a CPU cost. While this is low compared to forks, it still exists.
Spawning hundreds or thousands of threads per second can affect the performance of the system, depending on the CPU cores available. Having a much larger number of threads than the available cores lead to these threads fighting over the resources - one being the CPU cores, second can be the storage.
For example if we have a 16 core system, but we let to spawn 160 search threads, that means 10 threads per CPU core. The threads will not run simultaneous, but will have to wait in line for the CPU core to allocate cycles for them. On the storage side, we can have 160 potential requesters for reading data from it. If the storage can’t keep serving the threads without delays (latencies), the delays can affect the allocation on CPU cores as a thread could receive green light to execute it’s calculations on the CPU core, but has to wait for the data.
max_children setting allow to put a limit on how many threads can be active at once. When the limit is reached, the engine will start to dismiss incoming connections with a ‘maxed out’ error. When sizing max_children, you need to keep in mind that the thread is killed when client close the connection. If the connection is idling, the thread will remain active and counted by max_children. max_children can be increased to a limit that depends on the server capabilities to handle a number of active queries. This includes processing capabilities (CPU) and IO operations (storage). One often mistake is to raise the max_children to very high values. There is a point, depending on system capabilities, when spawning too many threads leads only to slowdown of the queries. If there’s nothing more in terms of optimizing indexes and queries, a new search server should be considered for splitting the load.
Thread_pool
In the previous section we talked about threading having a cost and that connections can keep threads alive even if they are not used.
A different strategy to managing threads is to not create a thread per connection, but use a pool of threads for processing the work. In this case, the working threads are not connected with the incoming connections as the connections are handled by a different class of threads. When the daemon starts, it creates a number of threads. Incoming network connections are handle by so called network threads, their job being to receive the queries from the connections and route them to the threads of the pool.
If the pool is too busy (all workers are processing queries), new queries are sent to a pending queue. The server will ‘max out’ if the pending queue gets filled. By default, no limit is set, however enqueuing for too much time does nothing by getting queries that take long time. It’s a good idea to have a limit on the queue to signal that the server is receiving more queries than it can handle. This can set with directive queue_max_length. Compared with max_children in threads model, where maxing out starts after a certain number of open connections, in case of thread_pool the daemon will starting ‘max out’ after active queries are more than max_children (number of threads in the pool) + queue_max_length.
In case of thread_pool, max_children directive defines the number of threads in the pool, which are created at daemon start. By default, a value of 1.5x the number of CPU cores is used. Setting max_children to high values will not increase the performance or capacity of the server. As these threads handle just do the processing, they need to have access to the CPU as soon as possible.
Setting high values will only lead, as in previous section, to CPU contention and slower performance, but also a slower start of the daemon. Compared with threads mode, where max_children can be set several times the numbers of CPU cores and values in terms of even several hundreds can make sense, in case of thread_pool it makes little sense to go 2-3x the number of number of cores.
By default, to handle the network connections one dedicated thread is used. The job of the thread is generally easy, as it only acts as proxy to the thread_pool, and generally it can sustain a good amount of traffic. The number of network threads can be increased with net_workers directive. Raising the network threads showed gains for setups with extreme rates of queries per second.
The network thread(s) also have several fine tuning settings. A network thread might not be always busy,as there can be times (in sub-second terms) when it doesn’t receive queries. The thread will go to sleep and loose it’s priority to CPU. The times spend on wake up or get back it’s place on the CPU are small, but for high-performance every millisecond matter.
To overcome this, a busy loop can be activated with net_wait_tm directive. If net_wait_tm value is positive, the thread will ’tick’ the CPU every 10*net_wait_tm ( where net_wait_tm value represents milliseconds). If net_wait_tm is zero, the busy loop runs continuous - it should be noted that this will generate extra CPU usage even if the network thread doesn’t receive much traffic. To disable the busy loop, a negative value of -1 should be used. By default net_wait_tm has value 1, meaning busy loop ticks every 10ms.
The network thread can also be throttle. Option net_throttle_accept enforce a limit on how many clients the network thread to accept at once and net_throttle_action explicits how many requests to process per iteration. By default, no limit is enforced. Throttling the network thread make sense in cases of high load scenarios.
An aspect that is something misunderstood is the relation between max_children and dist_threads. dist_threads is responsible for multi-threading some operations, like a local distributed index, snippets with load_files or CALL PQ command. However these are subprocessing threads and they don’t count for max_children directive which counts only the main query threads.