docs and minor tweaks

This commit is contained in:
bkellam 2025-09-19 21:29:05 -07:00
parent 0e527f4e08
commit f90db801cf
18 changed files with 196 additions and 39 deletions

View file

@ -46,6 +46,7 @@
"docs/features/code-navigation",
"docs/features/analytics",
"docs/features/mcp-server",
"docs/features/permission-syncing",
{
"group": "Agents",
"tag": "experimental",

View file

@ -33,17 +33,19 @@ Sourcebot syncs the config file on startup, and automatically whenever a change
The following are settings that can be provided in your config file to modify Sourcebot's behavior
| Setting | Type | Default | Minimum | Description / Notes |
|-------------------------------------------|---------|------------|---------|----------------------------------------------------------------------------------------|
| `maxFileSize` | number | 2MB | 1 | Maximum size (bytes) of a file to index. Files exceeding this are skipped. |
| `maxTrigramCount` | number | 20000 | 1 | Maximum trigrams per document. Larger files are skipped. |
| `reindexIntervalMs` | number | 1hour | 1 | Interval at which all repositories are reindexed. |
| `resyncConnectionIntervalMs` | number | 24hours | 1 | Interval for checking connections that need resyncing. |
| `resyncConnectionPollingIntervalMs` | number | 1second | 1 | DB polling rate for connections that need resyncing. |
| `reindexRepoPollingIntervalMs` | number | 1second | 1 | DB polling rate for repos that should be reindexed. |
| `maxConnectionSyncJobConcurrency` | number | 8 | 1 | Concurrent connectionsync jobs. |
| `maxRepoIndexingJobConcurrency` | number | 8 | 1 | Concurrent repoindexing jobs. |
| `maxRepoGarbageCollectionJobConcurrency` | number | 8 | 1 | Concurrent repogarbagecollection jobs. |
| `repoGarbageCollectionGracePeriodMs` | number | 10seconds | 1 | Grace period to avoid deleting shards while loading. |
| `repoIndexTimeoutMs` | number | 2hours | 1 | Timeout for a single repoindexing run. |
| `enablePublicAccess` **(deprecated)** | boolean | false | — | Use the `FORCE_ENABLE_ANONYMOUS_ACCESS` environment variable instead. |
| Setting | Type | Default | Minimum | Description / Notes |
|-------------------------------------------------|---------|------------|---------|----------------------------------------------------------------------------------------|
| `maxFileSize` | number | 2MB | 1 | Maximum size (bytes) of a file to index. Files exceeding this are skipped. |
| `maxTrigramCount` | number | 20000 | 1 | Maximum trigrams per document. Larger files are skipped. |
| `reindexIntervalMs` | number | 1hour | 1 | Interval at which all repositories are reindexed. |
| `resyncConnectionIntervalMs` | number | 24hours | 1 | Interval for checking connections that need resyncing. |
| `resyncConnectionPollingIntervalMs` | number | 1second | 1 | DB polling rate for connections that need resyncing. |
| `reindexRepoPollingIntervalMs` | number | 1second | 1 | DB polling rate for repos that should be reindexed. |
| `maxConnectionSyncJobConcurrency` | number | 8 | 1 | Concurrent connectionsync jobs. |
| `maxRepoIndexingJobConcurrency` | number | 8 | 1 | Concurrent repoindexing jobs. |
| `maxRepoGarbageCollectionJobConcurrency` | number | 8 | 1 | Concurrent repogarbagecollection jobs. |
| `repoGarbageCollectionGracePeriodMs` | number | 10seconds | 1 | Grace period to avoid deleting shards while loading. |
| `repoIndexTimeoutMs` | number | 2hours | 1 | Timeout for a single repoindexing run. |
| `enablePublicAccess` **(deprecated)** | boolean | false | — | Use the `FORCE_ENABLE_ANONYMOUS_ACCESS` environment variable instead. |
| `experiment_repoDrivenPermissionSyncIntervalMs` | number | 24hours | 1 | Interval at which the repo permission syncer should run. |
| `experiment_userDrivenPermissionSyncIntervalMs` | number | 24hours | 1 | Interval at which the user permission syncer should run. |

View file

@ -59,6 +59,7 @@ The following environment variables allow you to configure your Sourcebot deploy
| `AUTH_EE_OKTA_ISSUER` | `-` | <p>The issuer URL for Okta SSO authentication.</p> |
| `AUTH_EE_GCP_IAP_ENABLED` | `false` | <p>When enabled, allows Sourcebot to automatically register/login from a successful GCP IAP redirect</p> |
| `AUTH_EE_GCP_IAP_AUDIENCE` | - | <p>The GCP IAP audience to use when verifying JWT tokens. Must be set to enable GCP IAP JIT provisioning</p> |
| `EXPERIMENT_EE_PERMISSION_SYNC_ENABLED` | `false` | <p>Enables [permission syncing](/docs/features/permission-syncing).</p> |
### Review Agent Environment Variables

View file

@ -196,4 +196,8 @@ To connect to a GitHub host other than `github.com`, provide the `url` property
<GitHubSchema />
</Accordion>
</Accordion>
## See also
- [Syncing GitHub Access permissions to Sourcebot](/docs/features/permission-syncing#github)

View file

@ -3,9 +3,9 @@ title: "Agents Overview"
sidebarTitle: "Overview"
---
<Warning>
Agents are currently a experimental feature. Have an idea for an agent that we haven't built? Submit a [feature request](https://github.com/sourcebot-dev/sourcebot/issues/new?template=feature_request.md) on our GitHub.
</Warning>
import ExperimentalFeatureWarning from '/snippets/experimental-feature-warning.mdx'
<ExperimentalFeatureWarning />
Agents are automations that leverage the code indexed on Sourcebot to perform a specific task. Once you've setup Sourcebot, check out the
guides below to configure additional agents.

View file

@ -0,0 +1,72 @@
---
title: "Permission syncing"
sidebarTitle: "Permission syncing"
tag: "experimental"
---
import LicenseKeyRequired from '/snippets/license-key-required.mdx'
import ExperimentalFeatureWarning from '/snippets/experimental-feature-warning.mdx'
<LicenseKeyRequired />
<ExperimentalFeatureWarning />
# Overview
Permission syncing allows you to sync Access Permission Lists (ACLs) from a code host to Sourcebot. When configured, users signed into Sourcebot (via the code host's OAuth provider) will only be able to access repositories that they have access to on the code host. Practically, this means:
- Code Search results will only include repositories that the user has access to.
- Code navigation results will only include repositories that the user has access to.
- Ask Sourcebot (and the underlying LLM) will only have access to repositories that the user has access to.
- File browsing is scoped to the repositories that the user has access to.
Permission syncing can be enabled by setting the `EXPERIMENT_EE_PERMISSION_SYNC_ENABLED` environment variable to `true`.
```bash
docker run \
-e EXPERIMENT_EE_PERMISSION_SYNC_ENABLED=true \
/* additional args */ \
ghcr.io/sourcebot-dev/sourcebot:latest
```
## Platform support
We are actively working on supporting more code hosts. If you'd like to see a specific code host supported, please [reach out](https://www.sourcebot.dev/contact).
| Platform | Permission syncing |
|:----------|------------------------------|
| [GitHub (GHEC & GHEC Server)](/docs/features/permission-syncing#github) | ✅ |
| GitLab | 🛑 |
| Bitbucket Cloud | 🛑 |
| Bitbucket Data Center | 🛑 |
| Gitea | 🛑 |
| Gerrit | 🛑 |
| Generic git host | 🛑 |
# Getting started
## GitHub
Prerequisite: [Add GitHub as an OAuth provider](/docs/configuration/auth/providers#github).
Permission syncing works with **GitHub.com**, **GitHub Enterprise Cloud**, and **GitHub Enterprise Server**. For organization-owned repositories, users that have **read-only** access (or above) via the following methods will have their access synced to Sourcebot:
- Outside collaborators
- Organization members that are direct collaborators
- Organization members with access through team memberships
- Organization members with access through default organization permissions
- Organization owners.
**Notes:**
- A GitHub OAuth provider must be configured to (1) correlate a Sourcebot user with a GitHub user, and (2) to list repositories that the user has access to for [User driven syncing](/docs/features/permission-syncing#how-it-works).
- OAuth tokens must assume the `repo` scope in order to use the [List repositories for the authenticated user API](https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repositories-for-the-authenticated-user) during [User driven syncing](/docs/features/permission-syncing#how-it-works). Sourcebot **will only** use this token for **reads**.
# How it works
Permission syncing works by periodically syncing ACLs from the code host(s) to Sourcebot to build an internal mapping between Users and Repositories. This mapping is hydrated in two directions:
- **User driven** : fetches the list of all repositories that a given user has access to.
- **Repo driven** : fetches the list of all users that have access to a given repository.
User driven and repo driven syncing occurs every 24 hours by default. These intervals can be configured using the following settings in the [config file](/docs/configuration/config-file):
| Setting | Type | Default | Minimum |
|-------------------------------------------------|---------|------------|---------|
| `experiment_repoDrivenPermissionSyncIntervalMs` | number | 24 hours | 1 |
| `experiment_userDrivenPermissionSyncIntervalMs` | number | 24 hours | 1 |

View file

@ -0,0 +1,4 @@
<Warning>
This is an experimental feature. Certain functionality may be buggy or incomplete, and breaking changes may ship in non-major releases. Have feedback? Submit a [issue](https://github.com/sourcebot-dev/sourcebot/issues) on GitHub.
</Warning>

View file

@ -69,6 +69,16 @@
"deprecated": true,
"description": "This setting is deprecated. Please use the `FORCE_ENABLE_ANONYMOUS_ACCESS` environment variable instead.",
"default": false
},
"experiment_repoDrivenPermissionSyncIntervalMs": {
"type": "number",
"description": "The interval (in milliseconds) at which the repo permission syncer should run. Defaults to 24 hours.",
"minimum": 1
},
"experiment_userDrivenPermissionSyncIntervalMs": {
"type": "number",
"description": "The interval (in milliseconds) at which the user permission syncer should run. Defaults to 24 hours.",
"minimum": 1
}
},
"additionalProperties": false
@ -195,6 +205,16 @@
"deprecated": true,
"description": "This setting is deprecated. Please use the `FORCE_ENABLE_ANONYMOUS_ACCESS` environment variable instead.",
"default": false
},
"experiment_repoDrivenPermissionSyncIntervalMs": {
"type": "number",
"description": "The interval (in milliseconds) at which the repo permission syncer should run. Defaults to 24 hours.",
"minimum": 1
},
"experiment_userDrivenPermissionSyncIntervalMs": {
"type": "number",
"description": "The interval (in milliseconds) at which the user permission syncer should run. Defaults to 24 hours.",
"minimum": 1
}
},
"additionalProperties": false

View file

@ -15,7 +15,9 @@ export const DEFAULT_SETTINGS: Settings = {
maxRepoGarbageCollectionJobConcurrency: 8,
repoGarbageCollectionGracePeriodMs: 10 * 1000, // 10 seconds
repoIndexTimeoutMs: 1000 * 60 * 60 * 2, // 2 hours
enablePublicAccess: false // deprected, use FORCE_ENABLE_ANONYMOUS_ACCESS instead
enablePublicAccess: false, // deprected, use FORCE_ENABLE_ANONYMOUS_ACCESS instead
experiment_repoDrivenPermissionSyncIntervalMs: 1000 * 60 * 60 * 24, // 24 hours
experiment_userDrivenPermissionSyncIntervalMs: 1000 * 60 * 60 * 24, // 24 hours
}
export const PERMISSION_SYNC_SUPPORTED_CODE_HOST_TYPES = [

View file

@ -9,7 +9,7 @@ import { Job, Queue, Worker } from 'bullmq';
import { Redis } from 'ioredis';
import { env } from "../env.js";
import { createOctokitFromConfig, getUserIdsWithReadAccessToRepo } from "../github.js";
import { RepoWithConnections } from "../types.js";
import { RepoWithConnections, Settings } from "../types.js";
import { PERMISSION_SYNC_SUPPORTED_CODE_HOST_TYPES } from "../constants.js";
import { hasEntitlement } from "@sourcebot/shared";
@ -28,6 +28,7 @@ export class RepoPermissionSyncer {
constructor(
private db: PrismaClient,
private settings: Settings,
redis: Redis,
) {
this.queue = new Queue<RepoPermissionSyncJob>(QUEUE_NAME, {
@ -50,7 +51,7 @@ export class RepoPermissionSyncer {
return setInterval(async () => {
// @todo: make this configurable
const thresholdDate = new Date(Date.now() - 1000 * 60 * 60 * 24);
const thresholdDate = new Date(Date.now() - this.settings.experiment_repoDrivenPermissionSyncIntervalMs);
const repos = await this.db.repo.findMany({
// Repos need their permissions to be synced against the code host when...
@ -166,8 +167,14 @@ export class RepoPermissionSyncer {
const config = connection.config as unknown as GithubConnectionConfig;
const { octokit } = await createOctokitFromConfig(config, repo.orgId, this.db);
// @nocheckin - need to handle when repo displayName is not set.
const [owner, repoName] = repo.displayName!.split('/');
// @note: this is a bit of a hack since the displayName _might_ not be set..
// however, this property was introduced many versions ago and _should_ be set
// on each connection sync. Let's throw an error just in case.
if (!repo.displayName) {
throw new Error(`Repo ${id} does not have a displayName`);
}
const [owner, repoName] = repo.displayName.split('/');
const githubUserIds = await getUserIdsWithReadAccessToRepo(owner, repoName, octokit);

View file

@ -6,8 +6,9 @@ import { Job, Queue, Worker } from "bullmq";
import { Redis } from "ioredis";
import { PERMISSION_SYNC_SUPPORTED_CODE_HOST_TYPES } from "../constants.js";
import { env } from "../env.js";
import { getReposThatAuthenticatedUserHasReadAccessTo } from "../github.js";
import { createOctokitFromOAuthToken, getReposForAuthenticatedUser } from "../github.js";
import { hasEntitlement } from "@sourcebot/shared";
import { Settings } from "../types.js";
const logger = createLogger('user-permission-syncer');
@ -24,6 +25,7 @@ export class UserPermissionSyncer {
constructor(
private db: PrismaClient,
private settings: Settings,
redis: Redis,
) {
this.queue = new Queue<UserPermissionSyncJob>(QUEUE_NAME, {
@ -45,7 +47,7 @@ export class UserPermissionSyncer {
logger.debug('Starting scheduler');
return setInterval(async () => {
const thresholdDate = new Date(Date.now() - 1000 * 60 * 60 * 24);
const thresholdDate = new Date(Date.now() - this.settings.experiment_userDrivenPermissionSyncIntervalMs);
const users = await this.db.user.findMany({
where: {
@ -152,15 +154,11 @@ export class UserPermissionSyncer {
for (const account of user.accounts) {
const repoIds = await (async () => {
if (account.provider === 'github') {
// @todo: we will need to provide some mechanism for the user to provide a custom
// URL here. This will correspond to the host URL they are using for their GitHub
// instance.
const octokit = new Octokit({
auth: account.access_token,
// baseUrl: /* todo */
});
const repoIds = await getReposThatAuthenticatedUserHasReadAccessTo(octokit);
const octokit = await createOctokitFromOAuthToken(account.access_token);
// @note: we only care about the private repos since we don't need to build a mapping
// for public repos.
// @see: packages/web/src/prisma.ts
const repoIds = await getReposForAuthenticatedUser(/* visibility = */ 'private', octokit);
const repos = await this.db.repo.findMany({
where: {

View file

@ -54,6 +54,7 @@ export const env = createEnv({
GITLAB_CLIENT_QUERY_TIMEOUT_SECONDS: numberSchema.default(60 * 10),
EXPERIMENT_EE_PERMISSION_SYNC_ENABLED: booleanSchema.default('false'),
AUTH_EE_GITHUB_BASE_URL: z.string().optional(),
},
runtimeEnv: process.env,
emptyStringAsUndefined: true,

View file

@ -129,11 +129,10 @@ export const getUserIdsWithReadAccessToRepo = async (owner: string, repo: string
return collaborators.map(collaborator => collaborator.id.toString());
}
export const getReposThatAuthenticatedUserHasReadAccessTo = async (octokit: Octokit) => {
export const getReposForAuthenticatedUser = async (visibility: 'all' | 'private' | 'public' = 'all', octokit: Octokit) => {
const fetchFn = () => octokit.paginate(octokit.repos.listForAuthenticatedUser, {
per_page: 100,
// @todo: do we need to set a visibility to private only?
// visibility: 'private'
visibility,
});
const repos = await fetchWithRetry(fetchFn, `authenticated user`, logger);
@ -164,6 +163,14 @@ export const createOctokitFromConfig = async (config: GithubConnectionConfig, or
};
}
export const createOctokitFromOAuthToken = async (token: string | null): Promise<Octokit> => {
const apiUrl = env.AUTH_EE_GITHUB_BASE_URL ? `${env.AUTH_EE_GITHUB_BASE_URL}/api/v3` : "https://api.github.com";
return new Octokit({
auth: token,
baseUrl: apiUrl,
});
}
export const shouldExcludeRepo = ({
repo,
include,

View file

@ -69,8 +69,8 @@ const settings = await getSettings(env.CONFIG_PATH);
const connectionManager = new ConnectionManager(prisma, settings, redis);
const repoManager = new RepoManager(prisma, settings, redis, promClient, context);
const repoPermissionSyncer = new RepoPermissionSyncer(prisma, redis);
const userPermissionSyncer = new UserPermissionSyncer(prisma, redis);
const repoPermissionSyncer = new RepoPermissionSyncer(prisma, settings, redis);
const userPermissionSyncer = new UserPermissionSyncer(prisma, settings, redis);
await repoManager.validateIndexedReposHaveShards();

View file

@ -68,6 +68,16 @@ const schema = {
"deprecated": true,
"description": "This setting is deprecated. Please use the `FORCE_ENABLE_ANONYMOUS_ACCESS` environment variable instead.",
"default": false
},
"experiment_repoDrivenPermissionSyncIntervalMs": {
"type": "number",
"description": "The interval (in milliseconds) at which the repo permission syncer should run. Defaults to 24 hours.",
"minimum": 1
},
"experiment_userDrivenPermissionSyncIntervalMs": {
"type": "number",
"description": "The interval (in milliseconds) at which the user permission syncer should run. Defaults to 24 hours.",
"minimum": 1
}
},
"additionalProperties": false
@ -194,6 +204,16 @@ const schema = {
"deprecated": true,
"description": "This setting is deprecated. Please use the `FORCE_ENABLE_ANONYMOUS_ACCESS` environment variable instead.",
"default": false
},
"experiment_repoDrivenPermissionSyncIntervalMs": {
"type": "number",
"description": "The interval (in milliseconds) at which the repo permission syncer should run. Defaults to 24 hours.",
"minimum": 1
},
"experiment_userDrivenPermissionSyncIntervalMs": {
"type": "number",
"description": "The interval (in milliseconds) at which the user permission syncer should run. Defaults to 24 hours.",
"minimum": 1
}
},
"additionalProperties": false

View file

@ -102,6 +102,14 @@ export interface Settings {
* This setting is deprecated. Please use the `FORCE_ENABLE_ANONYMOUS_ACCESS` environment variable instead.
*/
enablePublicAccess?: boolean;
/**
* The interval (in milliseconds) at which the repo permission syncer should run. Defaults to 24 hours.
*/
experiment_repoDrivenPermissionSyncIntervalMs?: number;
/**
* The interval (in milliseconds) at which the user permission syncer should run. Defaults to 24 hours.
*/
experiment_userDrivenPermissionSyncIntervalMs?: number;
}
/**
* Search context

View file

@ -67,6 +67,16 @@
"deprecated": true,
"description": "This setting is deprecated. Please use the `FORCE_ENABLE_ANONYMOUS_ACCESS` environment variable instead.",
"default": false
},
"experiment_repoDrivenPermissionSyncIntervalMs": {
"type": "number",
"description": "The interval (in milliseconds) at which the repo permission syncer should run. Defaults to 24 hours.",
"minimum": 1
},
"experiment_userDrivenPermissionSyncIntervalMs": {
"type": "number",
"description": "The interval (in milliseconds) at which the user permission syncer should run. Defaults to 24 hours.",
"minimum": 1
}
},
"additionalProperties": false