open-webui/iac/README.md

# OpenWebUI Horizontal Scaling with Terraform

This Terraform configuration implements horizontal scaling for OpenWebUI using a blue-green deployment strategy with zero downtime.

## Architecture Overview

```
Microsoft Entra App Proxy → Internal ALB → ECS Service (3 instances) → Aurora PostgreSQL
                                     ↓            ↓
                               Redis Cluster  EFS Storage
```

## Key Components

- **Internal ALB**: Routes traffic to multiple ECS instances with sticky sessions
- **ElastiCache Redis**: Session management and WebSocket coordination
- **ECS Service**: `webui3-scaled` with 3 Fargate instances (2 vCPU, 8GB each)
- **Auto Scaling**: CPU-based scaling from 2-10 instances
- **Monitoring**: CloudWatch dashboards and alarms

## Directory Structure

```
iac/
├── modules/                     # Reusable Terraform modules
│   └── grafana-otel/           # Standalone Grafana OTEL monitoring module
│       ├── main.tf
│       ├── variables.tf
│       ├── outputs.tf
│       ├── versions.tf
│       └── README.md
├── grafana-standalone/         # Example standalone Grafana deployment
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   ├── terraform.tfvars.example
│   └── README.md
├── *.tf                        # Main OpenWebUI infrastructure
└── README.md                   # This file
```

### Grafana OTEL Monitoring Module

The `modules/grafana-otel/` directory contains a standalone, reusable Terraform module for deploying Grafana OTEL LGTM (Logs, Grafana, Tempo, Mimir) monitoring stack. This module can be deployed independently of the main OpenWebUI infrastructure.

**Features:**
- Complete OTEL observability stack (Grafana + Prometheus + Tempo + Loki)
- ECS Fargate deployment with autoscaling
- Service discovery integration
- Configurable security and network access
- CloudWatch logging integration

**Usage:**
- **Standalone**: Use `grafana-standalone/` for independent deployment with S3 remote state
- **Integrated**: Reference the module from other Terraform configurations

**State Management:**
- **Main Infrastructure**: `production/gravity-ai-chat/terraform.tfstate`  
- **Grafana Monitoring**: `production/grafana-monitoring/terraform.tfstate`

Both deployments use the same S3 bucket but separate state files for independent management.

See `modules/grafana-otel/README.md` for detailed documentation.

## Prerequisites

1. Terraform >= 1.0
2. AWS CLI configured with profile `908027381725_AdministratorAccess`
3. Existing OpenWebUI infrastructure running

## Remote Backend Configuration

This Terraform configuration uses an S3 remote backend for state management with DynamoDB for state locking, providing team collaboration and state consistency.

### Backend Configuration Details

```hcl
backend "s3" {
  bucket         = "gg-ai-terraform-states"
  key            = "production/gravity-ai-chat/terraform.tfstate"
  region         = "us-east-1"
  profile        = "908027381725_AdministratorAccess"
  dynamodb_table = "terraform-state-locks"
  encrypt        = true
}
```

### Key Features

- **S3 Bucket**: `gg-ai-terraform-states` - Centralized state storage
- **State Path**: `production/gravity-ai-chat/terraform.tfstate` - Environment-specific state file
- **Encryption**: Server-side encryption enabled for state file security
- **State Locking**: DynamoDB table `terraform-state-locks` prevents concurrent modifications
- **Profile-based Access**: Uses AWS profile `908027381725_AdministratorAccess` for authentication

### Backend Initialization

The backend is automatically configured when you run `terraform init`. However, ensure:

1. **S3 bucket exists**: The `gg-ai-terraform-states` bucket must be created beforehand
2. **DynamoDB table exists**: The `terraform-state-locks` table must exist with:
   - Primary key: `LockID` (String)
   - Billing mode: On-demand or provisioned
3. **AWS credentials**: Your profile `908027381725_AdministratorAccess` has appropriate permissions:
   - S3: `GetObject`, `PutObject`, `DeleteObject` on the state bucket
   - DynamoDB: `GetItem`, `PutItem`, `DeleteItem` on the locks table

### Troubleshooting Backend Issues

**Error: "NoSuchBucket"**
```bash
# Verify bucket exists
aws s3 ls s3://gg-ai-terraform-states --profile "908027381725_AdministratorAccess"
```

**Error: "ResourceNotFoundException" (DynamoDB)**
```bash
# Verify DynamoDB table exists
aws dynamodb describe-table \
  --table-name terraform-state-locks \
  --profile "908027381725_AdministratorAccess" \
  --region us-east-1
```

**Error: "AccessDenied"**
```bash
# Verify AWS profile configuration
aws sts get-caller-identity --profile "908027381725_AdministratorAccess"
```

## Deployment Steps

### Phase 1: Infrastructure Deployment

```bash
# 1. Initialize Terraform
cd terraform/openwebui-horizontal-scaling
terraform init

# 2. Review the plan
terraform plan

# 3. Deploy infrastructure
terraform apply
```

### Phase 2: Testing and Validation

```bash
# 1. Get ALB DNS name
ALB_DNS=$(terraform output -raw alb_dns_name)

# 2. Test health endpoint
curl -H "Host: ai-glondon.msappproxy.net" http://$ALB_DNS/health

# 3. Monitor service health
aws ecs describe-services \
  --cluster webUIcluster2 \
  --services webui3-scaled \
  --profile "908027381725_AdministratorAccess" \
  --region us-east-1
```

### Phase 3: Traffic Switch

1. **Update Entra App Proxy backend URL**:
   - FROM: Current service discovery endpoint
   - TO: `http://<alb-dns-name>` (from terraform output)

2. **Monitor the dashboard**:
   ```bash
   echo $(terraform output -raw dashboard_url)
   ```

3. **After validation, scale down old service**:
   ```bash
   aws ecs update-service \
     --cluster webUIcluster2 \
     --service webui3 \
     --desired-count 0 \
     --profile "908027381725_AdministratorAccess" \
     --region us-east-1
   ```

## Configuration Details

### Resource Allocation

| Component | Specification |
|-----------|---------------|
| ECS Instances | 3x (2 vCPU, 8GB RAM) |
| Redis | cache.t3.micro, encrypted |
| ALB | Internal, sticky sessions |
| Auto Scaling | 2-10 instances, 70% CPU target |

### Environment Variables Added

```bash
WEBSOCKET_MANAGER=redis
WEBSOCKET_REDIS_URL=<redis-endpoint>
ENABLE_WEBSOCKET_SUPPORT=true
WEBUI_SECRET_KEY=<shared-secret>
UVICORN_WORKERS=2
THREAD_POOL_SIZE=20
```

## Data Loss Prevention

✅ **Zero Data Loss Guarantee**:
- Database: Aurora PostgreSQL unchanged
- Files: EFS shared across all instances  
- Sessions: Redis-backed session storage
- Configuration: Database-backed syncing

## Cost Impact

| Resource | Monthly Cost |
|----------|--------------|
| ECS Fargate (3x smaller) | ~$400 |
| ElastiCache Redis | ~$17 |
| Internal ALB | ~$25 |
| **Total Increase** | **~$100/month** |

## Monitoring

### CloudWatch Dashboard

Access via: `terraform output dashboard_url`

**Metrics Tracked**:
- ECS CPU/Memory utilization
- ALB response times and error rates
- Redis performance metrics
- Auto scaling events

### Alarms

- **High CPU** (>85%): Scale out trigger
- **High Memory** (>90%): Resource pressure alert
- **ALB Response Time** (>2s): Performance degradation
- **Redis CPU** (>75%): Cache performance

## Cleanup

```bash
# Destroy infrastructure (keeps data)
terraform destroy

# Note: Aurora, EFS, and original service remain unchanged
```

## Security

- **Network**: Private subnets, security group isolation
- **Encryption**: Redis and EFS encryption enabled
- **Secrets**: AWS Secrets Manager for credentials
- **Access**: Internal ALB, no public internet exposure

# Blue-Green Deployment Strategy
![diagram](./blue-green-deployment-strategy.png)

---

**Important**: This deployment maintains the existing production service while adding horizontal scaling capability. Always test thoroughly before switching traffic.
feat: add terraform project 2025-07-24 17:39:52 +00:00			`# OpenWebUI Horizontal Scaling with Terraform`

			`This Terraform configuration implements horizontal scaling for OpenWebUI using a blue-green deployment strategy with zero downtime.`

			`## Architecture Overview`

			```
			`Microsoft Entra App Proxy → Internal ALB → ECS Service (3 instances) → Aurora PostgreSQL`
			`↓ ↓`
			`Redis Cluster EFS Storage`
			```

			`## Key Components`

			`- Internal ALB: Routes traffic to multiple ECS instances with sticky sessions`
			`- ElastiCache Redis: Session management and WebSocket coordination`
			- ECS Service: `webui3-scaled` with 3 Fargate instances (2 vCPU, 8GB each)
			`- Auto Scaling: CPU-based scaling from 2-10 instances`
			`- Monitoring: CloudWatch dashboards and alarms`

chore(infra): correct naming 2025-09-27 09:05:25 +00:00			`## Directory Structure`

			```
			`iac/`
			`├── modules/ # Reusable Terraform modules`
			`│ └── grafana-otel/ # Standalone Grafana OTEL monitoring module`
			`│ ├── main.tf`
			`│ ├── variables.tf`
			`│ ├── outputs.tf`
			`│ ├── versions.tf`
			`│ └── README.md`
			`├── grafana-standalone/ # Example standalone Grafana deployment`
			`│ ├── main.tf`
			`│ ├── variables.tf`
			`│ ├── outputs.tf`
			`│ ├── terraform.tfvars.example`
			`│ └── README.md`
			`├── *.tf # Main OpenWebUI infrastructure`
			`└── README.md # This file`
			```

			`### Grafana OTEL Monitoring Module`

			The `modules/grafana-otel/` directory contains a standalone, reusable Terraform module for deploying Grafana OTEL LGTM (Logs, Grafana, Tempo, Mimir) monitoring stack. This module can be deployed independently of the main OpenWebUI infrastructure.

			`Features:`
			`- Complete OTEL observability stack (Grafana + Prometheus + Tempo + Loki)`
			`- ECS Fargate deployment with autoscaling`
			`- Service discovery integration`
			`- Configurable security and network access`
			`- CloudWatch logging integration`

			`Usage:`
			- Standalone: Use `grafana-standalone/` for independent deployment with S3 remote state
			`- Integrated: Reference the module from other Terraform configurations`

			`State Management:`
			- Main Infrastructure: `production/gravity-ai-chat/terraform.tfstate`
			- Grafana Monitoring: `production/grafana-monitoring/terraform.tfstate`

			`Both deployments use the same S3 bucket but separate state files for independent management.`

			See `modules/grafana-otel/README.md` for detailed documentation.

feat: add terraform project 2025-07-24 17:39:52 +00:00			`## Prerequisites`

			`1. Terraform >= 1.0`
			2. AWS CLI configured with profile `908027381725_AdministratorAccess`
			`3. Existing OpenWebUI infrastructure running`

			`## Remote Backend Configuration`

			`This Terraform configuration uses an S3 remote backend for state management with DynamoDB for state locking, providing team collaboration and state consistency.`

			`### Backend Configuration Details`

			```hcl
			`backend "s3" {`
			`bucket = "gg-ai-terraform-states"`
			`key = "production/gravity-ai-chat/terraform.tfstate"`
			`region = "us-east-1"`
			`profile = "908027381725_AdministratorAccess"`
			`dynamodb_table = "terraform-state-locks"`
			`encrypt = true`
			`}`
			```

			`### Key Features`

			- S3 Bucket: `gg-ai-terraform-states` - Centralized state storage
			- State Path: `production/gravity-ai-chat/terraform.tfstate` - Environment-specific state file
			`- Encryption: Server-side encryption enabled for state file security`
			- State Locking: DynamoDB table `terraform-state-locks` prevents concurrent modifications
			- Profile-based Access: Uses AWS profile `908027381725_AdministratorAccess` for authentication

			`### Backend Initialization`

			The backend is automatically configured when you run `terraform init`. However, ensure:

			1. S3 bucket exists: The `gg-ai-terraform-states` bucket must be created beforehand
			2. DynamoDB table exists: The `terraform-state-locks` table must exist with:
			- Primary key: `LockID` (String)
			`- Billing mode: On-demand or provisioned`
			3. AWS credentials: Your profile `908027381725_AdministratorAccess` has appropriate permissions:
			- S3: `GetObject`, `PutObject`, `DeleteObject` on the state bucket
			- DynamoDB: `GetItem`, `PutItem`, `DeleteItem` on the locks table

			`### Troubleshooting Backend Issues`

			`Error: "NoSuchBucket"`
			```bash
			`# Verify bucket exists`
			`aws s3 ls s3://gg-ai-terraform-states --profile "908027381725_AdministratorAccess"`
			```

			`Error: "ResourceNotFoundException" (DynamoDB)`
			```bash
			`# Verify DynamoDB table exists`
			`aws dynamodb describe-table \`
			`--table-name terraform-state-locks \`
			`--profile "908027381725_AdministratorAccess" \`
			`--region us-east-1`
			```

			`Error: "AccessDenied"`
			```bash
			`# Verify AWS profile configuration`
			`aws sts get-caller-identity --profile "908027381725_AdministratorAccess"`
			```

			`## Deployment Steps`

			`### Phase 1: Infrastructure Deployment`

			```bash
			`# 1. Initialize Terraform`
			`cd terraform/openwebui-horizontal-scaling`
			`terraform init`

			`# 2. Review the plan`
			`terraform plan`

			`# 3. Deploy infrastructure`
			`terraform apply`
			```

			`### Phase 2: Testing and Validation`

			```bash
			`# 1. Get ALB DNS name`
			`ALB_DNS=$(terraform output -raw alb_dns_name)`

			`# 2. Test health endpoint`
			`curl -H "Host: ai-glondon.msappproxy.net" http://$ALB_DNS/health`

			`# 3. Monitor service health`
			`aws ecs describe-services \`
			`--cluster webUIcluster2 \`
			`--services webui3-scaled \`
			`--profile "908027381725_AdministratorAccess" \`
			`--region us-east-1`
			```

			`### Phase 3: Traffic Switch`

			`1. Update Entra App Proxy backend URL:`
			`- FROM: Current service discovery endpoint`
			- TO: `http://<alb-dns-name>` (from terraform output)

			`2. Monitor the dashboard:`
			```bash
			`echo $(terraform output -raw dashboard_url)`
			```

			`3. After validation, scale down old service:`
			```bash
			`aws ecs update-service \`
			`--cluster webUIcluster2 \`
			`--service webui3 \`
			`--desired-count 0 \`
			`--profile "908027381725_AdministratorAccess" \`
			`--region us-east-1`
			```

			`## Configuration Details`

			`### Resource Allocation`

			`\| Component \| Specification \|`
			`\|-----------\|---------------\|`
			`\| ECS Instances \| 3x (2 vCPU, 8GB RAM) \|`
			`\| Redis \| cache.t3.micro, encrypted \|`
			`\| ALB \| Internal, sticky sessions \|`
			`\| Auto Scaling \| 2-10 instances, 70% CPU target \|`

			`### Environment Variables Added`

			```bash
			`WEBSOCKET_MANAGER=redis`
			`WEBSOCKET_REDIS_URL=<redis-endpoint>`
			`ENABLE_WEBSOCKET_SUPPORT=true`
			`WEBUI_SECRET_KEY=<shared-secret>`
			`UVICORN_WORKERS=2`
			`THREAD_POOL_SIZE=20`
			```

			`## Data Loss Prevention`

			`✅ Zero Data Loss Guarantee:`
			`- Database: Aurora PostgreSQL unchanged`
			`- Files: EFS shared across all instances`
			`- Sessions: Redis-backed session storage`
			`- Configuration: Database-backed syncing`

			`## Cost Impact`

			`\| Resource \| Monthly Cost \|`
			`\|----------\|--------------\|`
			`\| ECS Fargate (3x smaller) \| ~$400 \|`
			`\| ElastiCache Redis \| ~$17 \|`
			`\| Internal ALB \| ~$25 \|`
			`\| Total Increase \| ~$100/month \|`

			`## Monitoring`

			`### CloudWatch Dashboard`

			Access via: `terraform output dashboard_url`

			`Metrics Tracked:`
			`- ECS CPU/Memory utilization`
			`- ALB response times and error rates`
			`- Redis performance metrics`
			`- Auto scaling events`

			`### Alarms`

			`- High CPU (>85%): Scale out trigger`
			`- High Memory (>90%): Resource pressure alert`
			`- ALB Response Time (>2s): Performance degradation`
			`- Redis CPU (>75%): Cache performance`

			`## Cleanup`

			```bash
			`# Destroy infrastructure (keeps data)`
			`terraform destroy`

			`# Note: Aurora, EFS, and original service remain unchanged`
			```

			`## Security`

			`- Network: Private subnets, security group isolation`
			`- Encryption: Redis and EFS encryption enabled`
			`- Secrets: AWS Secrets Manager for credentials`
			`- Access: Internal ALB, no public internet exposure`

			`# Blue-Green Deployment Strategy`
			`![diagram](./blue-green-deployment-strategy.png)`

			`---`

			`Important: This deployment maintains the existing production service while adding horizontal scaling capability. Always test thoroughly before switching traffic.`