The Sourcebot MCP server enables precise regular expression code search across repos hosted on [GitHub](https://docs.sourcebot.dev/docs/connections/github), [GitLab](https://docs.sourcebot.dev/docs/connections/gitlab), [BitBucket](https://docs.sourcebot.dev/docs/connections/bitbucket-cloud) and [more](#supported-code-hosts). This unlocks the capability for LLM agents to fetch code context for repositories that aren't checked out locally. Some use cases where precise search on a wider code context can help:
- Enriching responses to user requests:
- _"What repositories are using internal library X?"_
- _"Provide usage examples of the CodeMirror component"_
- _"Where is the `useCodeMirrorTheme` hook defined?"_
- _"Find all usages of `deprecatedApi` across all repos"_
- Improving reasoning ability for existing horizontal agents like AI code review, docs generation, etc.
- _"Find the definitions for all functions in this diff"_
- _"Document what systems depend on this class"_
- Building custom LLM horizontal agents like like compliance auditing agents, migration agents, etc.
- _"Find all instances of hardcoded credentials"_
- _"Identify repositories that depend on this depreacted api"_
## Getting Started
1. Install Node.JS >= v18.0.0.
2. (optional) Spin up a Sourcebot instance by following [this guide](https://docs.sourcebot.dev/self-hosting/overview). The host url of your instance (e.g., `http://localhost:3000`) is passed to the MCP server via the `SOURCEBOT_HOST` url.
If a host is not provided, then the server will fallback to using the demo instance hosted at https://demo.sourcebot.dev. You can see the list of repositories indexed [here](https://demo.sourcebot.dev/~/repos). Add additional repositories by [opening a PR](https://github.com/sourcebot-dev/sourcebot/blob/main/demo-site-config.json).
| Don't see your code host? Open a [GitHub discussion](https://github.com/sourcebot-dev/sourcebot/discussions/categories/ideas).
## Future Work
### Semantic Search
Currently, Sourcebot only supports regex-based code search (powered by [zoekt](https://github.com/sourcegraph/zoekt) under the hood). It is great for scenarios when the agent is searching for is something that is super precise and well-represented in the source code (e.g., a specific function name, a error string, etc.). It is not-so-great for _fuzzy_ searches where the objective is to find some loosely defined _category_ or _concept_ in the code (e.g., find code that verifies JWT tokens). The LLM can approximate this by crafting regex searches that attempt to capture a concept (e.g., it might try a query like `"jwt|token|(verify|validate).*(jwt|token)"`), but often yields sub-optimal search results that aren't related. Tools like Cursor solve this with [embedding models](https://docs.cursor.com/context/codebase-indexing) to capture the semantic meaning of code, allowing for LLMs to search using natural language. We would like to extend Sourcebot to support semantic search and expose this capability over MCP as a tool (e.g., `semantic_search_code` tool). [GitHub Discussion](https://github.com/sourcebot-dev/sourcebot/discussions/297)
### Code Navigation
Another idea is to allow LLMs to traverse abstract syntax trees (ASTs) of a codebase to enable reliable code navigation. This could be packaged as tools like `goto_definition`, `find_all_references`, etc., which could be useful for LLMs to get additional code context. [GitHub Discussion](https://github.com/sourcebot-dev/sourcebot/discussions/296)
### Got an idea?
Open up a [GitHub discussion](https://github.com/sourcebot-dev/sourcebot/discussions/categories/feature-requests)!