mirror of
https://github.com/open-webui/open-webui.git
synced 2025-12-22 09:15:21 +00:00
335 lines
8.3 KiB
Markdown
335 lines
8.3 KiB
Markdown
# Grafana OTEL Standalone Deployment
|
|
|
|
This directory contains a complete example for deploying the Grafana OTEL monitoring stack as a standalone service, independent from the main OpenWebUI infrastructure.
|
|
|
|
**Location**: This deployment example is located in `iac/grafana-standalone/` and uses the module from `iac/modules/grafana-otel/`.
|
|
|
|
## Quick Start
|
|
|
|
### 1. Prerequisites
|
|
|
|
- AWS CLI configured with appropriate permissions
|
|
- Terraform >= 1.0 installed
|
|
- Existing ECS cluster
|
|
- VPC with private subnets
|
|
- Access to S3 bucket `gg-ai-terraform-states` for state storage
|
|
|
|
### 2. Configuration
|
|
|
|
1. Copy the example variables file:
|
|
```bash
|
|
cp terraform.tfvars.example terraform.tfvars
|
|
```
|
|
|
|
2. Edit `terraform.tfvars` with your environment values:
|
|
```hcl
|
|
# Required: Update these values for your environment
|
|
vpc_id = "vpc-your-vpc-id"
|
|
private_subnet_ids = ["subnet-12345", "subnet-67890"]
|
|
cluster_name = "your-ecs-cluster"
|
|
|
|
# Optional: Customize as needed
|
|
grafana_admin_password = "your-secure-password"
|
|
allowed_cidr_blocks = ["your-vpn-cidr/24"]
|
|
```
|
|
|
|
### 3. Deploy
|
|
|
|
```bash
|
|
# Initialize Terraform with remote backend
|
|
terraform init
|
|
|
|
# Review the plan
|
|
terraform plan
|
|
|
|
# Deploy the infrastructure
|
|
terraform apply
|
|
```
|
|
|
|
**Note**: If you encounter AWS credential errors during `terraform init`, ensure your AWS CLI session is active:
|
|
```bash
|
|
# Refresh AWS credentials if needed
|
|
aws sts get-caller-identity --profile 908027381725_AdministratorAccess
|
|
```
|
|
|
|
## Remote State Backend
|
|
|
|
This deployment uses an S3 remote backend for state management with the following configuration:
|
|
|
|
```hcl
|
|
backend "s3" {
|
|
bucket = "gg-ai-terraform-states"
|
|
key = "production/grafana-monitoring/terraform.tfstate"
|
|
region = "us-east-1"
|
|
profile = "908027381725_AdministratorAccess"
|
|
dynamodb_table = "terraform-state-locks"
|
|
encrypt = true
|
|
}
|
|
```
|
|
|
|
### Key Benefits:
|
|
|
|
- **Team Collaboration**: Multiple team members can work with the same state
|
|
- **State Locking**: DynamoDB table prevents concurrent modifications
|
|
- **Encryption**: State file is encrypted at rest
|
|
- **Separate State**: Independent from main OpenWebUI infrastructure state
|
|
- **Versioning**: S3 bucket versioning enables state history and recovery
|
|
|
|
### State Path Structure:
|
|
|
|
- **Main Infrastructure**: `production/gravity-ai-chat/terraform.tfstate`
|
|
- **Grafana Monitoring**: `production/grafana-monitoring/terraform.tfstate`
|
|
|
|
This separation allows independent deployment and management of the monitoring stack.
|
|
|
|
### 4. Access Grafana
|
|
|
|
After deployment, Terraform will output the access information:
|
|
|
|
```bash
|
|
# Get the Grafana URL and credentials
|
|
terraform output grafana_dashboard_url
|
|
terraform output -json grafana_admin_credentials
|
|
|
|
# Get setup instructions
|
|
terraform output -raw setup_instructions
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
### Basic Configuration
|
|
|
|
For a simple deployment with default settings:
|
|
|
|
```hcl
|
|
# terraform.tfvars
|
|
vpc_id = "vpc-12345678"
|
|
private_subnet_ids = ["subnet-12345", "subnet-67890"]
|
|
cluster_name = "my-cluster"
|
|
```
|
|
|
|
### Production Configuration
|
|
|
|
For a production deployment with custom settings:
|
|
|
|
```hcl
|
|
# terraform.tfvars
|
|
environment = "production"
|
|
name_prefix = "prod-grafana"
|
|
|
|
# Increased resources
|
|
cpu = 2048
|
|
memory = 4096
|
|
desired_count = 2
|
|
|
|
# Autoscaling enabled
|
|
enable_autoscaling = true
|
|
max_capacity = 3
|
|
min_capacity = 2
|
|
|
|
# Longer log retention
|
|
log_retention_days = 30
|
|
|
|
# Custom Grafana credentials
|
|
grafana_admin_user = "monitoring-admin"
|
|
grafana_admin_password = "very-secure-password-123"
|
|
|
|
# Network access from specific CIDRs
|
|
allowed_cidr_blocks = [
|
|
"192.168.1.0/24", # Office network
|
|
"10.100.0.0/16", # VPN network
|
|
]
|
|
|
|
# Applications that will send telemetry
|
|
otlp_sources_security_group_ids = [
|
|
"sg-app1-security-group",
|
|
"sg-app2-security-group",
|
|
]
|
|
```
|
|
|
|
### Integration with Existing Service Discovery
|
|
|
|
If you have an existing service discovery namespace:
|
|
|
|
```hcl
|
|
# Use existing namespace
|
|
service_discovery_namespace_id = "ns-existing-12345"
|
|
service_name = "monitoring"
|
|
```
|
|
|
|
## Integration with Applications
|
|
|
|
After deploying Grafana, configure your applications to send telemetry data:
|
|
|
|
### 1. Add Application Security Groups
|
|
|
|
Update your `terraform.tfvars`:
|
|
|
|
```hcl
|
|
otlp_sources_security_group_ids = [
|
|
"sg-your-app-security-group",
|
|
]
|
|
```
|
|
|
|
Then run `terraform apply` to update the security group rules.
|
|
|
|
### 2. Configure Application Environment Variables
|
|
|
|
In your application deployment (ECS task definition, Kubernetes deployment, etc.):
|
|
|
|
```bash
|
|
# OpenTelemetry configuration
|
|
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-monitor.grafana-monitoring:4317
|
|
OTEL_EXPORTER_OTLP_INSECURE=true
|
|
OTEL_SERVICE_NAME=my-application
|
|
OTEL_RESOURCE_ATTRIBUTES=service.version=1.0.0,deployment.environment=production
|
|
```
|
|
|
|
### 3. Verify Integration
|
|
|
|
```bash
|
|
# Check service discovery
|
|
nslookup otel-monitor.grafana-monitoring
|
|
|
|
# Test OTLP endpoint connectivity
|
|
curl http://otel-monitor.grafana-monitoring:4317
|
|
|
|
# Access Grafana dashboard
|
|
curl http://otel-monitor.grafana-monitoring:3000
|
|
```
|
|
|
|
## Monitoring and Maintenance
|
|
|
|
### Viewing Logs
|
|
|
|
```bash
|
|
# View Grafana container logs
|
|
aws logs tail /ecs/grafana-otel --follow
|
|
|
|
# Check ECS service events
|
|
aws ecs describe-services --cluster your-cluster --services grafana-otel
|
|
```
|
|
|
|
### Scaling
|
|
|
|
```bash
|
|
# Manual scaling (if autoscaling is disabled)
|
|
aws ecs update-service --cluster your-cluster --service grafana-otel --desired-count 2
|
|
|
|
# Update autoscaling settings via Terraform
|
|
# Edit terraform.tfvars and run terraform apply
|
|
```
|
|
|
|
### Updates
|
|
|
|
```bash
|
|
# Update to latest Grafana OTEL image
|
|
terraform apply -var="container_image=grafana/otel-lgtm:latest"
|
|
|
|
# Update configuration
|
|
# Edit terraform.tfvars and run terraform apply
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Service not starting**
|
|
- Check CloudWatch logs for container errors
|
|
- Verify ECS cluster has capacity
|
|
- Check security group rules
|
|
|
|
2. **Cannot access Grafana UI**
|
|
- Verify allowed_cidr_blocks includes your IP
|
|
- Check VPC connectivity (VPN, bastion host)
|
|
- Confirm service discovery is working
|
|
|
|
3. **No telemetry data**
|
|
- Verify otlp_sources_security_group_ids
|
|
- Check application OTLP endpoint configuration
|
|
- Confirm network connectivity between services
|
|
|
|
### Useful Commands
|
|
|
|
```bash
|
|
# Check service status
|
|
terraform show | grep -A 10 "aws_ecs_service"
|
|
|
|
# Verify service discovery
|
|
aws servicediscovery list-services
|
|
|
|
# Check security groups
|
|
aws ec2 describe-security-groups --group-ids $(terraform output -raw security_group_id)
|
|
|
|
# View all outputs
|
|
terraform output
|
|
```
|
|
|
|
## Cleanup
|
|
|
|
To remove all resources:
|
|
|
|
```bash
|
|
terraform destroy
|
|
```
|
|
|
|
## State Management Commands
|
|
|
|
### Working with Remote State
|
|
|
|
```bash
|
|
# Initialize with remote backend (first time setup)
|
|
terraform init
|
|
|
|
# Migrate from local to remote state (if you have existing local state)
|
|
terraform init -migrate-state
|
|
|
|
# View remote state
|
|
terraform show
|
|
|
|
# List resources in state
|
|
terraform state list
|
|
|
|
# Pull remote state to local (for inspection)
|
|
terraform state pull > current-state.json
|
|
|
|
# Check state lock status
|
|
aws dynamodb describe-table --table-name terraform-state-locks --profile 908027381725_AdministratorAccess
|
|
```
|
|
|
|
### State Recovery and Backup
|
|
|
|
```bash
|
|
# Download current state from S3
|
|
aws s3 cp s3://gg-ai-terraform-states/production/grafana-monitoring/terraform.tfstate ./backup-state.tfstate --profile 908027381725_AdministratorAccess
|
|
|
|
# List state versions (if bucket versioning is enabled)
|
|
aws s3api list-object-versions --bucket gg-ai-terraform-states --prefix production/grafana-monitoring/terraform.tfstate --profile 908027381725_AdministratorAccess
|
|
|
|
# Force unlock state (if locked and lock is stale)
|
|
terraform force-unlock LOCK_ID
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
- Store sensitive variables (passwords) in environment variables or use AWS Secrets Manager
|
|
- Restrict `allowed_cidr_blocks` to minimum required networks
|
|
- Use strong passwords for Grafana admin account
|
|
- Regularly update the Grafana OTEL container image
|
|
- Monitor CloudWatch logs for security events
|
|
|
|
## Cost Estimation
|
|
|
|
Default configuration (1 task, 1 vCPU, 2GB RAM):
|
|
- ECS Fargate: ~$35-50/month
|
|
- CloudWatch Logs: ~$1-5/month (depending on log volume)
|
|
- Service Discovery: ~$0.50/month
|
|
|
|
Total estimated cost: ~$40-60/month
|
|
|
|
## Support
|
|
|
|
For issues or questions:
|
|
1. Check the module documentation: `../modules/grafana-otel/README.md`
|
|
2. Review Terraform and AWS documentation
|
|
3. Check CloudWatch logs for detailed error messages
|