# Vector Search On GitHub

This article presents a prototype that enhances GitHub's search functionality using semantic search technology with Manticore Search's Vector Search. It highlights the limitations of traditional keyword searches and introduces vector search as a more effective method that understands context and meaning, rather than just matching keywords. The piece outlines the setup and benefits of using vector search to navigate GitHub repositories more efficiently, showcasing examples of improved search accuracy and relevance.

## Introduction

GitHub's search function can sometimes struggle, especially when you try to search by asking a direct question. This approach often leads to unrelated results, which can be frustrating. This issue is more noticeable when searching through issues or pull requests, where the details really matter.

Let's look at an example:

![GitHub Search vs Manticore Semantic Search on GitHub](./github-semantic-search/github-vs-manticore-semantic-search.png)

GitHub's search has some limits, but the world of search technology is improving fast. Semantic search, which understands the context and meaning behind words, not just the words themselves, is becoming more popular. While GitHub hasn't added this feature yet, it could really help make searches better and more relevant.

With this in mind, we've built a [project](https://github.manticoresearch.com) that uses **semantic search** to help developers find things easier in their repositories. We use [Manticore Search](https://manticoresearch.com), which supports Vector Search, to offer a customizable **semantic search** option that fits different needs. This shows just how useful and powerful this new technology can be.

FYI, [Manticore Search](https://manticoresearch.com) is a powerful open-source search engine that has stood the test of time, with roots stretching back to 2001. Originally known as Sphinx, it served as a full-text search solution for MySQL and PostgreSQL databases. The project took on a life of its own when, in 2017, it was forked and reborn as **Manticore Search**, continuing to evolve as an independent, fully open-source search engine.

## What is Semantic Search?

**Semantic search** is a search technique that goes beyond just matching keywords. It tries to understand the meaning and context of words to improve search accuracy, considering both the user's intent and the contextual meaning of terms.

### Benefits

- **Context Understanding:** It interprets the context of queries to deliver precise results.
- **Improved Accuracy:** It reduces irrelevant results by understanding user intent.
- **Enhanced User Experience:** It saves time and effort by quickly providing relevant information.

## **The Problem with Traditional Search on GitHub**

When you search using simple keywords on GitHub, you often don't get what you're really looking for. Say you type in "bug fix" to find some help. The search might show you pages that mention "bug fix" exactly, but it could miss related topics like "error resolution" or "problem solving."

This type of searching can lead to lots of results that aren't helpful. Since these searches don't get the subtleties of how we talk, you can end up feeling pretty frustrated. Developers end up wasting too much time looking through these unrelated results, which can slow down their projects and cut down on how much they get done.

Keyword searches still have their place, though. For quick, specific searches where you know exactly what you need, like finding a particular error code, they can be super quick and straight to the point.

Here we are looking for ["integration bugfix" on GitHub](https://github.com/search?q=repo%3Amanticoresoftware%2Fmanticoresearch+integration+bugfix&type=issues) and found nothing in the same repository:

![Integration Bug Fix on Github](./github-semantic-search/integration-bugfix-github.png)


And the same search for [integration bugfix](https://github.manticoresearch.com/manticoresoftware/manticoresearch?query=integration%20bugfix;search=semantic-search;filters[index]=issues) on Manticore Semantic Search gives us the result we are much more likely to be happy with:

![Integration Bug Fix at Manticore Semantic Search](./github-semantic-search/integration-bugfix-manticore.png)

We've made a prototype of what semantic search on GitHub could look like. Check out our [GitHub Issue search demo](https://github.manticoresearch.com/manticoresoftware/manticoresearch?;search=semantic-search) powered by Manticore Vector Search. It allows you to search through GitHub issues, PRs, and comments in a way that understands context. This is especially useful when you can't remember the exact words of an issue but know its context. You can also add your own repository [here](https://github.manticoresearch.com/) and run the project locally by following the instructions [on GitHub](https://github.com/manticoresoftware/manticore-github-issue-search#demo-github-search-with-manticore-search).

Let's look at how this approach can improve the relevance and accuracy of your search results.

## Success Story: Adding Vector Search to Manticore GitHub Demo

When we integrated Vector Search into our [GitHub issue search demo](https://github.manticoresearch.com/), which showcases the capabilities of Manticore Search, the results were impressive. Traditional keyword searches are highly effective for queries where specific terms are known and accuracy in matching these terms is critical. However, the addition of semantic search complements this by allowing us to pinpoint exactly what users are looking for with greater precision, especially in contexts where the intent or meaning of the query matters as much as the specific words used.

Using pre-trained models from [Hugging Face](https://huggingface.co/), we turned text into high-dimensional vectors. These vectors understand the meaning behind the words, allowing us to do more accurate searches.

Here are a few examples of how it can improve the quality of search in the [Manticore Search repository](https://github.manticoresearch.com/manticoresoftware/manticoresearch):

### Example: Finding open bugs more easily

![Memory Leak Example](./github-semantic-search/memory-leak-sample.png)

Imagine you're a developer looking for issues related to a specific bug. A traditional search for "memory leak" might miss issues titled **"limit the memory usage"** or **"index out of memory"**. With Vector Search, the engine knows these terms are similar. This means you get all the relevant results without guessing all the possible keywords.

### Example: Checking if a feature request exists before opening a new one

![User Authentication Example](./github-semantic-search/user-authentication-sample.png)

Think about users searching for feature requests related to "user authentication." Keyword searches might only show issues with the exact phrase, but semantic search understands related terms like **"login system"**, **"Access denied"**, and **"Session-level user variables"**. This way, no valuable feedback is missed.

### Example: Easier collaboration

![API Rate Limits Example](./github-semantic-search/api-rate-limits-sample.png)

Contributors working on different parts of a project can really benefit from semantic search. For example, a search for **"API rate limits"** brings up relevant discussions about **"throttling"**, **"250 results"** limit, and **"rate limiting"**. This helps team members connect related issues even if they use different terms.

### Example: Security audits

![SQL Injection Example](./github-semantic-search/sql-injection-sample.png)

Security audits need thoroughness, often requiring searches for different security vulnerabilities. A search for **"SQL injection"** with traditional keyword methods might miss issues under **"database infiltration"** or **"SQL vulnerability"**. Semantic search makes sure all related security concerns are found, helping with more complete security audits.

## How to Get Started with Semantic Search on GitHub Using Manticore Search

To implement semantic search in our GitHub demo project, we followed these steps:

- **Setting Up Manticore**: We integrated Manticore Search with our project by [installing Manticore Search](https://manticoresearch.com/install/) along with the [Columnar library](https://github.com/manticoresoftware/columnar/) implementing the vector search functionality.

- **Creating the Database Structure**: We set up a real-time table in Manticore Search to store GitHub issues and their semantic representations. These representations, also known as embeddings, are stored as arrays of numbers (also known as vectors). The table includes fields for the issue text, a unique identifier, and a vector that captures the semantic meaning of the text.

    Here's an example of the schema we used:

    ```sql
    CREATE TABLE issues (
    	id BIGINT,
    	body TEXT,
    	vector FLOAT_VECTOR knn_type='hnsw' knn_dims='4' hnsw_similarity='l2'
    );
    ```

    In this setup, `body` holds the text of the issue, `id` is a unique identifier, and `vector` is a text embedding for the `body`.

    You might be asking: **What are text embeddings?**

    **Text embeddings** are a way to turn words or phrases into numbers that show their meaning and how they relate to each other. Think of it as a method to convert text into a format that computers can understand. These number representations help machines analyze text better, making it easier to compare different texts and find similarities.

    In our example, the vector is a series of numbers that capture the main idea of the text in the body field. This allows us to do things like finding similar issues or grouping related topics, even if they use different words to describe the same idea.

    We used an AI model from [Sentence Transformers](https://huggingface.co/sentence-transformers). If you want an easy way to get started, we suggest checking out the HuggingFace [Text Embedding API service](https://github.com/huggingface/text-embeddings-inference). It lets you run your own API and create embeddings tailored to your needs.

- **Insert Data**: Fill your table with vector data.

    Let's take a look at how an insert statement would work with the schema we just talked about. This will give us a clear picture of how data is added to our database structure.

    ```sql
    INSERT INTO issues VALUES (
      1,
      'Hello World',
      (0.653448, 0.192478, 0.017971, 0.339821)
    ), (
      2,
      'This is a bug',
      (-0.148894, 0.748278, 0.091892, -0.095406)
    );
    ```

- **Query Data**: Retrieve contextually relevant information using vector-based queries.

    To fetch documents using vector queries, we follow these steps:

    1. Get the search query.
    2. Generate a **text embedding** for the query.
    3. Use the resulting vector in a query like this:


    ```sql
    SELECT id, body
    FROM issues
    WHERE knn ( vector, 10, (0.286569, -0.031816, 0.066684, 0.032926) );
    ```

    Note that `10` is the parameter K, which represents the number of nearest neighbors (closest vectors) to retain in the result set. By default, the results are sorted by vector distance, with the closest ones appearing first.

That's a wrap! In just a few steps, we've built a Semantic Search using text embeddings and the Vector Search feature of Manticore Search. It's as simple as that! 🙌

## Keyword search vs Semantic search

When it comes to enhancing code search on GitHub, **semantic search** offers some distinct advantages:

- **Better search results**: Semantic search understands the meaning behind your queries, allowing you to find relevant code, even if it doesn't match the exact keywords you use.
- **Context-aware code exploration**: It can navigate through large codebases more intelligently, helping you understand how different pieces of code relate to each other.
- **Efficient troubleshooting**: By understanding the context, semantic search can quickly surface relevant issues, solutions, and code snippets that help resolve bugs faster.
- **Easier discovery of relevant implementations and ideas**: It can identify similar implementations or suggest alternative approaches based on the code's intent, not just its wording.

However, it's important to consider the limitations of **semantic search** compared to traditional **keyword search**:

- **Computational complexity**: Running semantic searches can require significant processing power and may take longer than keyword searches, especially in large repositories.
- **Potential for misinterpretation**: The AI might not always get the context or intent of a query right, leading to less relevant results.
- **Lack of precise control**: Developers might find it harder to locate exact phrases or specific code snippets when the search engine is interpreting meaning rather than matching keywords.
- **Dependency on training data**: The quality of semantic search results is closely tied to the AI model's training data, which means it may not always align perfectly with the latest code patterns or terminology.

Given these considerations, the future of GitHub search likely lies in a hybrid approach. By blending the strengths of both semantic and keyword search, GitHub can offer a more powerful tool for developers:

- **User interface options**: Allowing users to toggle between semantic and keyword search depending on their immediate needs.
- **Hybrid search algorithms**: Combining the deep understanding of semantic search with the precision of keyword matching to provide the most relevant results.
- **Contextual switching**: Automatically choosing the best search method based on the type of query and user behavior, ensuring the best possible outcomes.

Incorporating both methods into GitHub's search capabilities will help developers find the right code, faster and more efficiently, balancing the nuanced understanding of semantic search with the reliability and speed of keyword search.

## The Future of GitHub Search: Smarter with Semantic Search

GitHub's traditional keyword-based search will be evolving toward a smarter, more intuitive approach: semantic search. This game-changing technology is set to revolutionize how developers interact with repositories, boosting productivity and making the development process smoother, especially when searching through pull requests, issues, and comments.

**Semantic search** for pull requests, issues, and comments offers several key advantages:

1. **Context-aware results**: Unlike traditional search that relies on exact keyword matches, semantic search understands the context and intent behind your query. This means you're more likely to find the relevant pull requests, issues, and comments, even if they don't use the exact words you searched for.
2. **Natural language processing**: You can search using everyday language, without needing to remember specific syntax or keywords. This makes it easier to find what you need.
3. **Improved relevance ranking**: Semantic search can prioritize results based on how closely they match the meaning of your query, saving you time when navigating through numerous pull requests, issues, or comments.
4. **Understanding of synonyms and related concepts**: The search engine can recognize related terms and concepts, widening the scope of relevant results without the need for multiple searches.
5. **Enhanced collaboration**: By making it easier to find related discussions and contributions, semantic search can improve team collaboration and knowledge sharing within projects.
6. **Historical context**: Semantic search has the potential to understand the evolution of discussions in issues and pull requests, offering more comprehensive results that include relevant historical context.
7. **Cross-repository insights**: Advanced semantic search could potentially provide insights across multiple repositories, helping developers discover related discussions or solutions in other projects.

The implementation of semantic search is becoming more achievable thanks to advanced databases like **Manticore Search**. With its built-in **vector search** capabilities, Manticore Search is paving the way for platforms like GitHub to adopt this cutting-edge technology.

While GitHub hasn't fully integrated semantic search yet, developers can experience its power through our [demo project](https://github.com/manticoresoftware/manticore-github-issue-search). This open-source initiative showcases the potential of semantic search in a GitHub-like environment.
