Home  //  Tatoeba

Tatoeba

About Tatoeba

Tatoeba is a large database of sentences and translations. Its content is ever-growing and results from the voluntary contributions of thousands of members.

Tatoeba provides a tool for you to see examples of how words are used in the context of a sentence. You specify words that interest you, and it returns sentences containing these words with their translations in the desired languages. The name Tatoeba (for example in Japanese) captures this concept.

The project was founded by Trang Ho in 2006, hosted on Sourceforge under the codename of multilangdict.

The challenge

We knew about Manticore since November 2017, but took us a while to effectively migrate. We were using Sphinx but lately was crashing quite often and as a result making our homepage completely broken ( #1767).

Why Manticore?

A long time ago (2010) we were using Lucene and decided to switch to Sphinx due to memory restrictions. Before switching to Manticore we had a quick look at other solutions, like Elastic Search, but rewriting all the search-related code would been a big effort. While Elastic has a lot of fancy stuff, our data if pretty “flat’ (sentences with metadata) and Manticore just fit in.

Outcome

From #1767: “we now only have some quick performance drops, instead of a continuous failure. In addition, it looks like the search daemon does not block any more when this happens, so the page will just be slow or failing for a few visitors.”
But search speed (and overall our website speed) seems to have improved.

We have between 220K-280K searches per month or 7.5-10K per day.

Trang Ho & Gilles Bedel

Tatoeba

Training

Personal and team training will maximize them performance. 

Custom development

Need cone custom or individual features?

Fill the form and don’t forget to make the description of what you need.

Free config review

There are often optimizations that can be made to a Sphinx / Manticore setup by changing some simple directives in the configuration or making quick changes to an index definition.

Some common mistakes and issues can include:

  • doing main+delta without kill-lists, even if the delta does include updated records found in the main
  • using wildcarding with very short prefix/infix which can hammer performance in some cases
  • disabled (unintentional) seamless rotates and getting stalls on index rotations
  • adding texts as string attributes even if they are not using for any kind of operation (filtering, grouping, sorting) or mandatory to be present in results
  • using deprecated settings 

Having a quick look on the configuration can show issues or potential issues, this is why we want to offer a gift to our growing community!

When uploading your configuration file, we recommend to remove any database credentials first.

We suggest also you give as many possible details about your setup: how big is the data you have, how typical queries look and what issues you experience.

Contact us