Sphinx 3 vs Manticore: 性能基准测试

已发布: Mar 18, 2018
自动翻译: Sphinx 3 vs Manticore: performance benchmark

[更新] 更新的基准测试在这里。

最近期待已久的 Sphinx 3 被发布并在 3.0.2 中进行了更新。它具有文档存储能力、A 索引、预索引片段，不幸的是，不再是开源的（至少现在，在 2018 年 3 月）。

这些都是非常好的功能，但你是否关心它们对 Sphinx 3 性能的影响有多大，以及与 Manticore 的性能有多大差异？我们也一样！

为了弄清楚这一点，我们做了一个基准测试来测量：

索引时间
Sphinx 3 和 Manticore Search 2.6.2 能提供的最大吞吐量
两者能提供的最小延迟

测试基于以下内容：

luceneutil 用于生成索引和查询集的数据
lucene2manticore 用于将数据从 Lucene 转换为 Manticore Search / Sphinx 格式
stress-tester 用于基准测试
服务器：8xIntel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz, 64G RAM, HDD
操作系统：Ubuntu 16.04.3 LTS，内核 4.8.0-45-generic

结果如下：

ms_vs_s3_throughput ms_vs_s3_latency 从所有测试场景中可以看出，Sphinx 3 的索引时间明显更高，性能也显著较差：无论是吞吐量还是延迟。我们倾向于相信这可能是由于一些编译问题（再说一次，Sphinx 3 不是开源的，因此我们无法重新编译）或一些一般的性能泄漏，如果源代码可用，可能会被调试和修复。如果新功能使性能恶化到这种程度，那将是非常遗憾的。但无论如何，我们想警告所有 Manticore 和 Sphinx 的用户，如果您迁移到 Sphinx 3，可能会遇到性能下降。

如果您有迁移到 Sphinx 3 或 Manticore 与 Sphinx 3 之间对比的不同结果，请告诉我们，了解在哪些情况下性能不会下降将是非常不错的。

如果您需要我们的帮助，请联系我们。

您可以如何重现基准测试

请注意，下载和准备数据可能需要几个小时。

安装上述补充工具并准备配置和停用词文件：

mkdir data
mkdir q

git clone http://github.com/mikemccand/luceneutil.git
git clone http://github.com/manticoresoftware/lucene2manticore
git clone http://github.com/Ivinco/stress-tester

cp lucene2manticore/*.conf ./

安装 Manticore Search 和 Sphinx3 二进制文件。
获取并准备源数据

cd luceneutil
python src/python/setup.py -download
cd ../data/
xzcat enwiki-20120502-lines-1k.txt.lzma > lucene.tsv

将数据从 Lucene TSV 样式格式转换为可以与 Manticore Search 和 Sphinx 数据源一起使用的适当 TSV 格式：

cd ..
python lucene2manticore/lucene2tsv.py data/lucene.tsv --maxlen 2097152 > data/lc.tsv
head -n 100000 data/lc.tsv >  data/lc100k.tsv
head -n 300000 data/lc.tsv > data/lc300k.tsv
head -n 1000000 data/lc.tsv > data/lc1m.tsv

准备查询

python lucene2manticore/lucene2query.py --types simple data/wikimedium500.tasks > q/q-wiki500-simple.txt
python lucene2manticore/lucene2query.py --types ext2 data/wikimedium500.tasks > q/q-wiki500-ext2.txt
python lucene2manticore/lucene2query.py --types simple luceneutil/tasks/wikimedium.10M.datefacets.nostopwords.tasks > q/q-wiki10m-simple.txt
python lucene2manticore/lucene2query.py --types ext2 luceneutil/tasks/wikimedium.10M.datefacets.nostopwords.tasks > q/q-wiki10m-ext2.txt
python lucene2manticore/lucene2query.py --types simple luceneutil/tasks/wikimedium.1M.nostopwords.tasks > q/q-wiki1m-simple.txt
python lucene2manticore/lucene2query.py --types ext2 luceneutil/tasks/wikimedium.1M.nostopwords.tasks > q/q-wiki1m-ext2.txt
cat q/q-wiki*-simple.txt > q/q-simple.txt
cat q/q-wiki*-ext2.txt > q/q-ext2.txt

准备停用词

indexer -c lucene2manticore/sphinx3.conf i2_1m_no_stopwords --buildstops stopwords1k.txt 1000
head -100 stopwords1k.txt > stopwords.txt

索引数据并记录所需时间：

./indexer -c lucene2manticore/manticore.conf --all
./indexer -c lucene2manticore/sphinx3.conf --all

启动搜索守护进程

/path/to/manticore/searchd -c lucene2manticore/manticore.conf
/path/to/sphinx3/searchd -c lucene2manticore/sphinx3.conf

预热服务器

在测试之前对搜索守护进程进行预热是值得的，例如：

cd stress-tester
for q in simple ext2; do for p in 8306 7406; do ./test.php --plugin=plain.php --data=../q/q-$q.txt -b=100 -c=8 --port=$p --index=i2_100k_stopwords_100 --maxmatches=100 --csv; done; done;

吞吐量测试案例

我们现在知道索引的时间是多少（见第 4 点）。让我们看看 Sphinx 3 和 Manticore Search 能提供多少吞吐量。

针对 100K 文档索引的简单查询，使用前 100 个停用词：

for port in 7406 8306; do for c in 1 4 6 8 12; do for batchSize in 1 100; do ./test.php --plugin=plain.php --data=../q/q-simple.txt -b=$batchSize -c=$c --port=$port --index=i2_100k_stopwords_100 --maxmatches=1000 --csv; done; done; done

针对 100K 文档索引的简单查询，使用前 1000 个停用词：

for port in 7406 8306; do for c in 1 4 6 8 12; do for batchSize in 1 100; do ./test.php --plugin=plain.php --data=../q/q-simple.txt -b=$batchSize -c=$c --port=$port --index=i2_100k_stopwords_1k --maxmatches=1000 --csv; done; done; done

复杂查询针对100K文档索引，使用前100个停用词：

for port in 7406 8306; do for c in 1 4 6 8 12; do for batchSize in 1 100; do ./test.php --plugin=plain.php --data=../q/q-ext2.txt -b=$batchSize -c=$c --port=$port --index=i2_100k_stopwords_100 --maxmatches=1000 --csv; done; done; done

复杂查询针对100K文档索引，使用前1000个停用词：

for port in 7406 8306; do for c in 1 4 6 8 12; do for batchSize in 1 100; do ./test.php --plugin=plain.php --data=../q/q-ext2.txt -b=$batchSize -c=$c --port=$port --index=i2_100k_stopwords_1k --maxmatches=1000 --csv; done; done; done

简单查询针对100K文档索引，使用前100个停用词并启用形态学：

for port in 7406 8306; do for c in 1 8; do for batchSize in 1 100; do ./test.php --plugin=plain.php --data=../q/q-simple.txt -b=$batchSize -c=$c --port=$port --index=i2_100k_stopwords_100_morphology --maxmatches=1000 --csv; done; done; done

简单查询针对1M文档索引，使用前100个停用词：

for port in 7406 8306; do for c in 1 8; do for batchSize in 1 100; do ./test.php --plugin=plain.php --data=../q/q-simple.txt -b=$batchSize -c=$c --port=$port --index=i2_1m_stopwords_100 --maxmatches=1000 --csv; done; done; done

复杂查询针对1M文档索引，使用前100个停用词：

for port in 7406 8306; do for c in 1 8; do for batchSize in 1 100; do ./test.php --plugin=plain.php --data=../q/q-ext2.txt -b=$batchSize -c=$c --port=$port --index=i2_1m_stopwords_100 --maxmatches=1000 --csv; done; done; done

简单查询针对1M文档索引，使用前100个停用词并启用形态学：

for port in 7406 8306; do for c in 1 8; do for batchSize in 1 100; do ./test.php --plugin=plain.php --data=../q/q-simple.txt -b=$batchSize -c=$c --port=$port --index=i2_1m_stopwords_100_morphology --maxmatches=1000 --csv; done; done; done

简单查询针对1M文档索引，使用前100个停用词，通过属性过滤以跳过一半文档：

for port in 7406 8306; do for c in 1 8; do for batchSize in 1 100; do ./test.php --plugin=plain.php --data=../q/q-simple.txt -b=$batchSize -c=$c --port=$port --index=i2_1m_stopwords_100 --maxmatches=1000 --filter='ts<1199141654' --csv; done; done; done