About startup, mmap, mlock and –force-preread

As it was mentioned in article “Indexes load at startup” now all the indexes (attributes and word lists files) are not physically loaded into RAM, but memory-mapped instead. It allows to bring them up much faster on startup, but also has some side consequences I want to explain.

First of all, because we do mapping indexes may not be permanently locked in physical RAM, and it’s not required for you to have so much RAM for all of them to fit. Having reasonable amount of RAM may already provide you with quick search queries in many cases since when the indexes are cached, they work significantly faster.

Second consequence - actually memory mapping occupies just a region of ‘virtual address space’ of the process. As long as on any modern system you have 64 bits for addresses, we can say that we can load and serve an index of virtually any size, despite the actual free RAM. Note however, that it is related only to indexes data. Manticore search daemon also needs physical RAM for usual work, like internal hashes, buffers, arrays, etc.

If you look at memory stats of the process, you’ll see some number in RSS (or RES) column, and that is actually occupied RAM (mostly the heap), but not loaded indexes (unless you do mlock=1). They are mostly reflected in ‘VSZ’ column. Also if you load a huge index (about whole RAM space) and then perform ‘free’ command, you’ll see that it is practically not reflected in ‘used’ space, but mostly in ‘buff/cache’, and so, also in ‘available’.

So loaded indexes by default are NOT locked in the memory, but just cached. If the OS needs to allocate more RAM for another processes it will just do that sacrificing the cached data. So ’loading’ the index provides no guarantee that it is actually in RAM and will respond predictably fast.

What does it mean in practice?

  1. First, no guarantees by default. ‘Loading’ index at start by memory mapping and then stepping page-by-page inside that map just loads the page you step in. OS does not guarantee that after stepping one page, the next step of ’loading’ locks previously one persistently in RAM. Yes, it may - if say you have 128GB of free RAM and the loaded index is only 30GB. But if you have an index sized 120GB and only 16GB of RAM ’loading’ will success the very same way, but since the index cannot not fit into the RAM, it will not be fully cached and it’s response time will increase.
  2. Second, no guarantee that the loaded index will keep the same response time permanently. Imagine, again, that you load a 30GB index on a system with 50GB of free RAM, and all seems to work fast. But then you also load another RAM-greedy process, and it occupies 40GB. That means, that from 30GB of your index only ~10GB will be still cached, and access to the rest now needs reading from disk.

So, neither lazy preread, nor even --force-preread option guarantees that the whole index is cached and will respond permanently and predictably fast. No guarantee, only probability. More RAM you have - more probability that the whole index is cached and will respond maximally fast. All this mmap “massage” is just about the probability.

But I need the guarantee, not probability! Is it possible?

Yes! The one (and only) way to surely lock whole index in the RAM is use of mlock option. It should be set in index config (not in the command line options). It requires you to have privileges for doing so (see system ‘man mlock’ for details). How it works? The daemon will mmap the index files, and then call ‘mlock’ on them. At the moment the OS will recognise if it has enough RAM to load all the desired maps, and if so, it will immediately perform the loading. That could be relatively long operation (just take the speed of your storage and estimate the time it needs to load the required amount of data sequentially).

Thus we can achieve the goal - to have an index fully locked in RAM which responds predictable fast. That is good.

But also it is necessary to mention some things with regards to mlock.

  1. First, as it was mentioned - you need the privileges to run it. This partially comes from the way how it works and may impact the whole system. It’s not a big deal though in most cases unless you’re using a shared hosting with very limited permissions.
  2. Secondly, caching of map (mlocking) is a blocking process we can’t manage. Internally we just call mlock(), it does some magic inside and returns in few seconds/minutes after all is done. No way to interrupt, no way to throttle i/o, just wait. So, process of mlocking may affect another i/o ops on the machine.
  3. When system looks for RAM for mlocking chances are it will call OOM-killer in order to free the RAM for you which can kill another process. Be aware!
  4. Even if you use mlock you may still want to use --force-preread in many cases. The dilemma here is:
    • w/o --force-preread the searchd will start serving connections sooner, but the indexes will be colder until they’re full preread in the background. This may be bad for the incoming queries.
    • with --force-preread you will have to wait (perhaps few minutes), but after that you will be able to provide very good performance.

Here’s what it can look like with mlock, but w/o --force-preread:

no_force_prereadAnd the same indexes at the same hardware with --force-preread:

force_prereadAs you can see in this case it makes sense to wait 6 min at start, otherwise the avg response time becomes times higher for tens of minutes and the iowait is extremely high too due to the queries that do random disk reads. Of course there may be other cases and your load balancing may be working differently and handle situations like this smarter way or you may just not have enough RAM to fit the whole index or your queries may be lighter. Just consider the both approaches and choose the one which fits you best.

What else may be important?

  1. Play with OS parameters like ‘swapinness’ or disable swapping at all if you can afford it. It can help to increase the probability of fast response (without mlocking). Note that on modern linux kernels you have such a wonderful thing as control-groups (aka cgroups). You can put your daemon into a dedicated cgroup and tune any system things (like mentioned swapinness) for it not touching the global system settings.
  2. Modern SSDs are quite fast even for random access, so using them may dissolve the difference between just ‘mapped’ (‘mlocked’) and ‘cached’ data.

Install Manticore Search

Install Manticore Search