open-webui/iac/README.md

265 lines
7.7 KiB
Markdown
Raw Normal View History

2025-07-24 17:39:52 +00:00
# OpenWebUI Horizontal Scaling with Terraform
This Terraform configuration implements horizontal scaling for OpenWebUI using a blue-green deployment strategy with zero downtime.
## Architecture Overview
```
Microsoft Entra App Proxy → Internal ALB → ECS Service (3 instances) → Aurora PostgreSQL
↓ ↓
Redis Cluster EFS Storage
```
## Key Components
- **Internal ALB**: Routes traffic to multiple ECS instances with sticky sessions
- **ElastiCache Redis**: Session management and WebSocket coordination
- **ECS Service**: `webui3-scaled` with 3 Fargate instances (2 vCPU, 8GB each)
- **Auto Scaling**: CPU-based scaling from 2-10 instances
- **Monitoring**: CloudWatch dashboards and alarms
2025-09-27 09:05:25 +00:00
## Directory Structure
```
iac/
├── modules/ # Reusable Terraform modules
│ └── grafana-otel/ # Standalone Grafana OTEL monitoring module
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ ├── versions.tf
│ └── README.md
├── grafana-standalone/ # Example standalone Grafana deployment
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ ├── terraform.tfvars.example
│ └── README.md
├── *.tf # Main OpenWebUI infrastructure
└── README.md # This file
```
### Grafana OTEL Monitoring Module
The `modules/grafana-otel/` directory contains a standalone, reusable Terraform module for deploying Grafana OTEL LGTM (Logs, Grafana, Tempo, Mimir) monitoring stack. This module can be deployed independently of the main OpenWebUI infrastructure.
**Features:**
- Complete OTEL observability stack (Grafana + Prometheus + Tempo + Loki)
- ECS Fargate deployment with autoscaling
- Service discovery integration
- Configurable security and network access
- CloudWatch logging integration
**Usage:**
- **Standalone**: Use `grafana-standalone/` for independent deployment with S3 remote state
- **Integrated**: Reference the module from other Terraform configurations
**State Management:**
- **Main Infrastructure**: `production/gravity-ai-chat/terraform.tfstate`
- **Grafana Monitoring**: `production/grafana-monitoring/terraform.tfstate`
Both deployments use the same S3 bucket but separate state files for independent management.
See `modules/grafana-otel/README.md` for detailed documentation.
2025-07-24 17:39:52 +00:00
## Prerequisites
1. Terraform >= 1.0
2. AWS CLI configured with profile `908027381725_AdministratorAccess`
3. Existing OpenWebUI infrastructure running
## Remote Backend Configuration
This Terraform configuration uses an S3 remote backend for state management with DynamoDB for state locking, providing team collaboration and state consistency.
### Backend Configuration Details
```hcl
backend "s3" {
bucket = "gg-ai-terraform-states"
key = "production/gravity-ai-chat/terraform.tfstate"
region = "us-east-1"
profile = "908027381725_AdministratorAccess"
dynamodb_table = "terraform-state-locks"
encrypt = true
}
```
### Key Features
- **S3 Bucket**: `gg-ai-terraform-states` - Centralized state storage
- **State Path**: `production/gravity-ai-chat/terraform.tfstate` - Environment-specific state file
- **Encryption**: Server-side encryption enabled for state file security
- **State Locking**: DynamoDB table `terraform-state-locks` prevents concurrent modifications
- **Profile-based Access**: Uses AWS profile `908027381725_AdministratorAccess` for authentication
### Backend Initialization
The backend is automatically configured when you run `terraform init`. However, ensure:
1. **S3 bucket exists**: The `gg-ai-terraform-states` bucket must be created beforehand
2. **DynamoDB table exists**: The `terraform-state-locks` table must exist with:
- Primary key: `LockID` (String)
- Billing mode: On-demand or provisioned
3. **AWS credentials**: Your profile `908027381725_AdministratorAccess` has appropriate permissions:
- S3: `GetObject`, `PutObject`, `DeleteObject` on the state bucket
- DynamoDB: `GetItem`, `PutItem`, `DeleteItem` on the locks table
### Troubleshooting Backend Issues
**Error: "NoSuchBucket"**
```bash
# Verify bucket exists
aws s3 ls s3://gg-ai-terraform-states --profile "908027381725_AdministratorAccess"
```
**Error: "ResourceNotFoundException" (DynamoDB)**
```bash
# Verify DynamoDB table exists
aws dynamodb describe-table \
--table-name terraform-state-locks \
--profile "908027381725_AdministratorAccess" \
--region us-east-1
```
**Error: "AccessDenied"**
```bash
# Verify AWS profile configuration
aws sts get-caller-identity --profile "908027381725_AdministratorAccess"
```
## Deployment Steps
### Phase 1: Infrastructure Deployment
```bash
# 1. Initialize Terraform
cd terraform/openwebui-horizontal-scaling
terraform init
# 2. Review the plan
terraform plan
# 3. Deploy infrastructure
terraform apply
```
### Phase 2: Testing and Validation
```bash
# 1. Get ALB DNS name
ALB_DNS=$(terraform output -raw alb_dns_name)
# 2. Test health endpoint
curl -H "Host: ai-glondon.msappproxy.net" http://$ALB_DNS/health
# 3. Monitor service health
aws ecs describe-services \
--cluster webUIcluster2 \
--services webui3-scaled \
--profile "908027381725_AdministratorAccess" \
--region us-east-1
```
### Phase 3: Traffic Switch
1. **Update Entra App Proxy backend URL**:
- FROM: Current service discovery endpoint
- TO: `http://<alb-dns-name>` (from terraform output)
2. **Monitor the dashboard**:
```bash
echo $(terraform output -raw dashboard_url)
```
3. **After validation, scale down old service**:
```bash
aws ecs update-service \
--cluster webUIcluster2 \
--service webui3 \
--desired-count 0 \
--profile "908027381725_AdministratorAccess" \
--region us-east-1
```
## Configuration Details
### Resource Allocation
| Component | Specification |
|-----------|---------------|
| ECS Instances | 3x (2 vCPU, 8GB RAM) |
| Redis | cache.t3.micro, encrypted |
| ALB | Internal, sticky sessions |
| Auto Scaling | 2-10 instances, 70% CPU target |
### Environment Variables Added
```bash
WEBSOCKET_MANAGER=redis
WEBSOCKET_REDIS_URL=<redis-endpoint>
ENABLE_WEBSOCKET_SUPPORT=true
WEBUI_SECRET_KEY=<shared-secret>
UVICORN_WORKERS=2
THREAD_POOL_SIZE=20
```
## Data Loss Prevention
**Zero Data Loss Guarantee**:
- Database: Aurora PostgreSQL unchanged
- Files: EFS shared across all instances
- Sessions: Redis-backed session storage
- Configuration: Database-backed syncing
## Cost Impact
| Resource | Monthly Cost |
|----------|--------------|
| ECS Fargate (3x smaller) | ~$400 |
| ElastiCache Redis | ~$17 |
| Internal ALB | ~$25 |
| **Total Increase** | **~$100/month** |
## Monitoring
### CloudWatch Dashboard
Access via: `terraform output dashboard_url`
**Metrics Tracked**:
- ECS CPU/Memory utilization
- ALB response times and error rates
- Redis performance metrics
- Auto scaling events
### Alarms
- **High CPU** (>85%): Scale out trigger
- **High Memory** (>90%): Resource pressure alert
- **ALB Response Time** (>2s): Performance degradation
- **Redis CPU** (>75%): Cache performance
## Cleanup
```bash
# Destroy infrastructure (keeps data)
terraform destroy
# Note: Aurora, EFS, and original service remain unchanged
```
## Security
- **Network**: Private subnets, security group isolation
- **Encryption**: Redis and EFS encryption enabled
- **Secrets**: AWS Secrets Manager for credentials
- **Access**: Internal ALB, no public internet exposure
# Blue-Green Deployment Strategy
![diagram](./blue-green-deployment-strategy.png)
---
**Important**: This deployment maintains the existing production service while adding horizontal scaling capability. Always test thoroughly before switching traffic.