Recently we’ve remade our system threads. Since we’ve forked from Sphinx and earlier searchd had all system tasks made in an often-rise-up style. Each service worked in a dedicated thread which wakes up every 50ms and checks whether it has something to do or not. That means that even an idle daemon wakes 200 times a sec ‘just to check’ a task. Having 6 such tasks will make 1200 wakes a sec, and it became noticeable especially by clients on Amazon AWS which counts CPU usage. Those internal tasks were:
- plain index rotation
- pinging an agent
- index preload
- flushing index attributes
- flushing a RealTime index
- and flushing binary logs
All of them are quite rare (like flush RT may happen once per every 10 hours), so checking them 200 times a sec is just wasting of CPU. Also, all those tasks are not solid by nature (like appending text to a huge log), but just periodical actions (repeat one and same task periodically). Having in mind this manner of actions, we’ve totally rewritten the whole stuff:
- First, we’ve added one thread which serves timers, where each action is scheduled.
- Second, we’ve added thread-pool to perform actions themselves.
- So that finally a service action (task) is scheduled, and when the timer hits, it is moved to thread poll, and then, executed.
- On finishing it is deleted, so periodical tasks just reschedule themselves at the end and produce totally new task.
On such an approach, there is no bunch of dedicated service threads, just one-timer thread. And it, in turn, does not wake every 20ms, but has a binary heap with timeouts and wake only at the period specified by the earliest timer in the queue. Thread pool, in turn, may run up to 32 threads in parallel, but actually only one, sometimes two are in the game. Each thread has predefined idle period (10m) after which it is just finished, so in the case of ’no tasks’ nothing is run idle at all (even timer thread is initialized a ’lazy’ way, i.e. is started only when actual task need to be scheduled). In the case of very seldom (>10min of the period) tasks, the worker thread pool is also abandoned, so that worker threads are created only when it is something to do. All service threads, so, are removed and no more kick CPU hundreds time a sec.
The most promising change of such an approach is a new behavior of the ‘ping’ task. In the past, we collected all agent hosts, and every ping interval issued ‘ping’ command to them as one bunch. So, if one host was slow, the whole bunch was slow too. Also, it was nothing to do with actual hosts state - like when you load a host by queries, it is not necessary to ping it separately since queries themselves give comprehensive statistics about the host’s state. Now the ping is concrete: it is planned for each host separately and is stuck to actual last_answer_time of each one. If the host is slow - only it’s ping task will wait for it, another will work normally at its time. If a host is under load - it’s last_answer_time will be monotonically updated, so that actual ping will not happen if some query has already happened since last_query_time and ping_interval.
Another feature now is that tasks can work in parallel. Let’s say, index flushing may be performed to any num of indexes at the same time, not serially, but in parallel. For now, this number is set to 2 jobs, but it’s just a matter of tuning. Also when several similar tasks are scheduled, now we can limit the number of them. Say, ‘malloc_trim’ has no sense to be scheduled more than once, so it is kind of ‘singleton’ - if one is scheduled, another try to schedule will be dropped.
Next feature added with such task management flows from the fact that now all the tasks are scheduled/planned in one queue (a not different bunch of threads), and we exactly know when it will run. So, now such statistic may be displayed, and it is done by adding ‘debug sched’, ‘debug tasks’ and ‘debug systhreads’.
The first one shows the timer’s binary heap: the topmost value indicates the next timeout and task associated with it; other values come next (however they’re displayed as raw binary heap now; they may be sorted if such need will come).
‘debug tasks’ shows all the tasks registered in the daemon, with their statistics and properties (how many such tasks may run in parallel; how many may be scheduled; how many currently executed, how much time spent to a task, when it was last finished, how many times it was executed, how many were dropped because of queue overload, and how many are now enqueued).
Last ‘debug systhreads’ displays state of worker threads, like internal id, thread id, last run time, total CPU time, total num of ticks and jobs are done, how much time the last job take, and how long worker is idle. As mentioned, idling for 10 minutes causes the worker to stop.
Upgrade to Manticore 3.1.0 or a newer version to benefit from this change