open-webui/iac/grafana-standalone/README.md

336 lines
8.3 KiB
Markdown
Raw Normal View History

# Grafana OTEL Standalone Deployment
This directory contains a complete example for deploying the Grafana OTEL monitoring stack as a standalone service, independent from the main OpenWebUI infrastructure.
**Location**: This deployment example is located in `iac/grafana-standalone/` and uses the module from `iac/modules/grafana-otel/`.
## Quick Start
### 1. Prerequisites
- AWS CLI configured with appropriate permissions
- Terraform >= 1.0 installed
- Existing ECS cluster
- VPC with private subnets
- Access to S3 bucket `gg-ai-terraform-states` for state storage
### 2. Configuration
1. Copy the example variables file:
```bash
cp terraform.tfvars.example terraform.tfvars
```
2. Edit `terraform.tfvars` with your environment values:
```hcl
# Required: Update these values for your environment
vpc_id = "vpc-your-vpc-id"
private_subnet_ids = ["subnet-12345", "subnet-67890"]
cluster_name = "your-ecs-cluster"
# Optional: Customize as needed
grafana_admin_password = "your-secure-password"
allowed_cidr_blocks = ["your-vpn-cidr/24"]
```
### 3. Deploy
```bash
# Initialize Terraform with remote backend
terraform init
# Review the plan
terraform plan
# Deploy the infrastructure
terraform apply
```
**Note**: If you encounter AWS credential errors during `terraform init`, ensure your AWS CLI session is active:
```bash
# Refresh AWS credentials if needed
aws sts get-caller-identity --profile 908027381725_AdministratorAccess
```
## Remote State Backend
This deployment uses an S3 remote backend for state management with the following configuration:
```hcl
backend "s3" {
bucket = "gg-ai-terraform-states"
key = "production/grafana-monitoring/terraform.tfstate"
region = "us-east-1"
profile = "908027381725_AdministratorAccess"
dynamodb_table = "terraform-state-locks"
encrypt = true
}
```
### Key Benefits:
- **Team Collaboration**: Multiple team members can work with the same state
- **State Locking**: DynamoDB table prevents concurrent modifications
- **Encryption**: State file is encrypted at rest
- **Separate State**: Independent from main OpenWebUI infrastructure state
- **Versioning**: S3 bucket versioning enables state history and recovery
### State Path Structure:
- **Main Infrastructure**: `production/gravity-ai-chat/terraform.tfstate`
- **Grafana Monitoring**: `production/grafana-monitoring/terraform.tfstate`
This separation allows independent deployment and management of the monitoring stack.
### 4. Access Grafana
After deployment, Terraform will output the access information:
```bash
# Get the Grafana URL and credentials
terraform output grafana_dashboard_url
terraform output -json grafana_admin_credentials
# Get setup instructions
terraform output -raw setup_instructions
```
## Configuration Options
### Basic Configuration
For a simple deployment with default settings:
```hcl
# terraform.tfvars
vpc_id = "vpc-12345678"
private_subnet_ids = ["subnet-12345", "subnet-67890"]
cluster_name = "my-cluster"
```
### Production Configuration
For a production deployment with custom settings:
```hcl
# terraform.tfvars
environment = "production"
name_prefix = "prod-grafana"
# Increased resources
cpu = 2048
memory = 4096
desired_count = 2
# Autoscaling enabled
enable_autoscaling = true
max_capacity = 3
min_capacity = 2
# Longer log retention
log_retention_days = 30
# Custom Grafana credentials
grafana_admin_user = "monitoring-admin"
grafana_admin_password = "very-secure-password-123"
# Network access from specific CIDRs
allowed_cidr_blocks = [
"192.168.1.0/24", # Office network
"10.100.0.0/16", # VPN network
]
# Applications that will send telemetry
otlp_sources_security_group_ids = [
"sg-app1-security-group",
"sg-app2-security-group",
]
```
### Integration with Existing Service Discovery
If you have an existing service discovery namespace:
```hcl
# Use existing namespace
service_discovery_namespace_id = "ns-existing-12345"
service_name = "monitoring"
```
## Integration with Applications
After deploying Grafana, configure your applications to send telemetry data:
### 1. Add Application Security Groups
Update your `terraform.tfvars`:
```hcl
otlp_sources_security_group_ids = [
"sg-your-app-security-group",
]
```
Then run `terraform apply` to update the security group rules.
### 2. Configure Application Environment Variables
In your application deployment (ECS task definition, Kubernetes deployment, etc.):
```bash
# OpenTelemetry configuration
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-monitor.grafana-monitoring:4317
OTEL_EXPORTER_OTLP_INSECURE=true
OTEL_SERVICE_NAME=my-application
OTEL_RESOURCE_ATTRIBUTES=service.version=1.0.0,deployment.environment=production
```
### 3. Verify Integration
```bash
# Check service discovery
nslookup otel-monitor.grafana-monitoring
# Test OTLP endpoint connectivity
curl http://otel-monitor.grafana-monitoring:4317
# Access Grafana dashboard
curl http://otel-monitor.grafana-monitoring:3000
```
## Monitoring and Maintenance
### Viewing Logs
```bash
# View Grafana container logs
aws logs tail /ecs/grafana-otel --follow
# Check ECS service events
aws ecs describe-services --cluster your-cluster --services grafana-otel
```
### Scaling
```bash
# Manual scaling (if autoscaling is disabled)
aws ecs update-service --cluster your-cluster --service grafana-otel --desired-count 2
# Update autoscaling settings via Terraform
# Edit terraform.tfvars and run terraform apply
```
### Updates
```bash
# Update to latest Grafana OTEL image
terraform apply -var="container_image=grafana/otel-lgtm:latest"
# Update configuration
# Edit terraform.tfvars and run terraform apply
```
## Troubleshooting
### Common Issues
1. **Service not starting**
- Check CloudWatch logs for container errors
- Verify ECS cluster has capacity
- Check security group rules
2. **Cannot access Grafana UI**
- Verify allowed_cidr_blocks includes your IP
- Check VPC connectivity (VPN, bastion host)
- Confirm service discovery is working
3. **No telemetry data**
- Verify otlp_sources_security_group_ids
- Check application OTLP endpoint configuration
- Confirm network connectivity between services
### Useful Commands
```bash
# Check service status
terraform show | grep -A 10 "aws_ecs_service"
# Verify service discovery
aws servicediscovery list-services
# Check security groups
aws ec2 describe-security-groups --group-ids $(terraform output -raw security_group_id)
# View all outputs
terraform output
```
## Cleanup
To remove all resources:
```bash
terraform destroy
```
## State Management Commands
### Working with Remote State
```bash
# Initialize with remote backend (first time setup)
terraform init
# Migrate from local to remote state (if you have existing local state)
terraform init -migrate-state
# View remote state
terraform show
# List resources in state
terraform state list
# Pull remote state to local (for inspection)
terraform state pull > current-state.json
# Check state lock status
aws dynamodb describe-table --table-name terraform-state-locks --profile 908027381725_AdministratorAccess
```
### State Recovery and Backup
```bash
# Download current state from S3
aws s3 cp s3://gg-ai-terraform-states/production/grafana-monitoring/terraform.tfstate ./backup-state.tfstate --profile 908027381725_AdministratorAccess
# List state versions (if bucket versioning is enabled)
aws s3api list-object-versions --bucket gg-ai-terraform-states --prefix production/grafana-monitoring/terraform.tfstate --profile 908027381725_AdministratorAccess
# Force unlock state (if locked and lock is stale)
terraform force-unlock LOCK_ID
```
## Security Considerations
- Store sensitive variables (passwords) in environment variables or use AWS Secrets Manager
- Restrict `allowed_cidr_blocks` to minimum required networks
- Use strong passwords for Grafana admin account
- Regularly update the Grafana OTEL container image
- Monitor CloudWatch logs for security events
## Cost Estimation
Default configuration (1 task, 1 vCPU, 2GB RAM):
- ECS Fargate: ~$35-50/month
- CloudWatch Logs: ~$1-5/month (depending on log volume)
- Service Discovery: ~$0.50/month
Total estimated cost: ~$40-60/month
## Support
For issues or questions:
1. Check the module documentation: `../modules/grafana-otel/README.md`
2. Review Terraform and AWS documentation
3. Check CloudWatch logs for detailed error messages