Lets Meet Pubchem - largest free chemical informatics web site in the world
PubChem group in the National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM), a branch of the United States National Institutes of Health (NIH). PubChem is the largest free chemical informatics web site in the world and has detailed information from 741 Data Sources for over 103M chemical compounds, 254M Substances, 269M Bioactivities, 31M Literature, 3M Patents, 1M BioAssays and more.
Manticore used to do full-text search among all collections (chemical compound, chemical substance, biological assay, patent, pubmed, protein, gene, taxonomy, disease, literature, pathway, pathway reaction, …), total 10TB data in size. They first tried Solr, but it did NOT scale with their data growth. Sphinx/Manticore turned out to be the light-weight but powerful search engine that exactly suits their needs.
Siqian He, US National Institutes of Health: “We cannot achieve such success without the Sphinx/Manticore search engine! Thank you all for making such a powerful search engine!”
Solutions and results
- Using C++ sphinx client lib to do a search related query.
- Deployment of search autocomplete functionality - this is a feature when an application predicts the rest of a not typed word. So the user can type only 2 or 3 first letters of a compound and be able to see different suggestions .
- Deployment of faceting - this gives the ability to make filtration of search results by different properties of searched items.