| .. | ||
| .terraform.lock.hcl | ||
| main.tf | ||
| outputs.tf | ||
| README.md | ||
| terraform.tfvars.example | ||
| variables.tf | ||
Grafana OTEL Standalone Deployment
This directory contains a complete example for deploying the Grafana OTEL monitoring stack as a standalone service, independent from the main OpenWebUI infrastructure.
Location: This deployment example is located in iac/grafana-standalone/ and uses the module from iac/modules/grafana-otel/.
Quick Start
1. Prerequisites
- AWS CLI configured with appropriate permissions
- Terraform >= 1.0 installed
- Existing ECS cluster
- VPC with private subnets
- Access to S3 bucket
gg-ai-terraform-statesfor state storage
2. Configuration
-
Copy the example variables file:
cp terraform.tfvars.example terraform.tfvars -
Edit
terraform.tfvarswith your environment values:# Required: Update these values for your environment vpc_id = "vpc-your-vpc-id" private_subnet_ids = ["subnet-12345", "subnet-67890"] cluster_name = "your-ecs-cluster" # Optional: Customize as needed grafana_admin_password = "your-secure-password" allowed_cidr_blocks = ["your-vpn-cidr/24"]
3. Deploy
# Initialize Terraform with remote backend
terraform init
# Review the plan
terraform plan
# Deploy the infrastructure
terraform apply
Note: If you encounter AWS credential errors during terraform init, ensure your AWS CLI session is active:
# Refresh AWS credentials if needed
aws sts get-caller-identity --profile 908027381725_AdministratorAccess
Remote State Backend
This deployment uses an S3 remote backend for state management with the following configuration:
backend "s3" {
bucket = "gg-ai-terraform-states"
key = "production/grafana-monitoring/terraform.tfstate"
region = "us-east-1"
profile = "908027381725_AdministratorAccess"
dynamodb_table = "terraform-state-locks"
encrypt = true
}
Key Benefits:
- Team Collaboration: Multiple team members can work with the same state
- State Locking: DynamoDB table prevents concurrent modifications
- Encryption: State file is encrypted at rest
- Separate State: Independent from main OpenWebUI infrastructure state
- Versioning: S3 bucket versioning enables state history and recovery
State Path Structure:
- Main Infrastructure:
production/gravity-ai-chat/terraform.tfstate - Grafana Monitoring:
production/grafana-monitoring/terraform.tfstate
This separation allows independent deployment and management of the monitoring stack.
4. Access Grafana
After deployment, Terraform will output the access information:
# Get the Grafana URL and credentials
terraform output grafana_dashboard_url
terraform output -json grafana_admin_credentials
# Get setup instructions
terraform output -raw setup_instructions
Configuration Options
Basic Configuration
For a simple deployment with default settings:
# terraform.tfvars
vpc_id = "vpc-12345678"
private_subnet_ids = ["subnet-12345", "subnet-67890"]
cluster_name = "my-cluster"
Production Configuration
For a production deployment with custom settings:
# terraform.tfvars
environment = "production"
name_prefix = "prod-grafana"
# Increased resources
cpu = 2048
memory = 4096
desired_count = 2
# Autoscaling enabled
enable_autoscaling = true
max_capacity = 3
min_capacity = 2
# Longer log retention
log_retention_days = 30
# Custom Grafana credentials
grafana_admin_user = "monitoring-admin"
grafana_admin_password = "very-secure-password-123"
# Network access from specific CIDRs
allowed_cidr_blocks = [
"192.168.1.0/24", # Office network
"10.100.0.0/16", # VPN network
]
# Applications that will send telemetry
otlp_sources_security_group_ids = [
"sg-app1-security-group",
"sg-app2-security-group",
]
Integration with Existing Service Discovery
If you have an existing service discovery namespace:
# Use existing namespace
service_discovery_namespace_id = "ns-existing-12345"
service_name = "monitoring"
Integration with Applications
After deploying Grafana, configure your applications to send telemetry data:
1. Add Application Security Groups
Update your terraform.tfvars:
otlp_sources_security_group_ids = [
"sg-your-app-security-group",
]
Then run terraform apply to update the security group rules.
2. Configure Application Environment Variables
In your application deployment (ECS task definition, Kubernetes deployment, etc.):
# OpenTelemetry configuration
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-monitor.grafana-monitoring:4317
OTEL_EXPORTER_OTLP_INSECURE=true
OTEL_SERVICE_NAME=my-application
OTEL_RESOURCE_ATTRIBUTES=service.version=1.0.0,deployment.environment=production
3. Verify Integration
# Check service discovery
nslookup otel-monitor.grafana-monitoring
# Test OTLP endpoint connectivity
curl http://otel-monitor.grafana-monitoring:4317
# Access Grafana dashboard
curl http://otel-monitor.grafana-monitoring:3000
Monitoring and Maintenance
Viewing Logs
# View Grafana container logs
aws logs tail /ecs/grafana-otel --follow
# Check ECS service events
aws ecs describe-services --cluster your-cluster --services grafana-otel
Scaling
# Manual scaling (if autoscaling is disabled)
aws ecs update-service --cluster your-cluster --service grafana-otel --desired-count 2
# Update autoscaling settings via Terraform
# Edit terraform.tfvars and run terraform apply
Updates
# Update to latest Grafana OTEL image
terraform apply -var="container_image=grafana/otel-lgtm:latest"
# Update configuration
# Edit terraform.tfvars and run terraform apply
Troubleshooting
Common Issues
-
Service not starting
- Check CloudWatch logs for container errors
- Verify ECS cluster has capacity
- Check security group rules
-
Cannot access Grafana UI
- Verify allowed_cidr_blocks includes your IP
- Check VPC connectivity (VPN, bastion host)
- Confirm service discovery is working
-
No telemetry data
- Verify otlp_sources_security_group_ids
- Check application OTLP endpoint configuration
- Confirm network connectivity between services
Useful Commands
# Check service status
terraform show | grep -A 10 "aws_ecs_service"
# Verify service discovery
aws servicediscovery list-services
# Check security groups
aws ec2 describe-security-groups --group-ids $(terraform output -raw security_group_id)
# View all outputs
terraform output
Cleanup
To remove all resources:
terraform destroy
State Management Commands
Working with Remote State
# Initialize with remote backend (first time setup)
terraform init
# Migrate from local to remote state (if you have existing local state)
terraform init -migrate-state
# View remote state
terraform show
# List resources in state
terraform state list
# Pull remote state to local (for inspection)
terraform state pull > current-state.json
# Check state lock status
aws dynamodb describe-table --table-name terraform-state-locks --profile 908027381725_AdministratorAccess
State Recovery and Backup
# Download current state from S3
aws s3 cp s3://gg-ai-terraform-states/production/grafana-monitoring/terraform.tfstate ./backup-state.tfstate --profile 908027381725_AdministratorAccess
# List state versions (if bucket versioning is enabled)
aws s3api list-object-versions --bucket gg-ai-terraform-states --prefix production/grafana-monitoring/terraform.tfstate --profile 908027381725_AdministratorAccess
# Force unlock state (if locked and lock is stale)
terraform force-unlock LOCK_ID
Security Considerations
- Store sensitive variables (passwords) in environment variables or use AWS Secrets Manager
- Restrict
allowed_cidr_blocksto minimum required networks - Use strong passwords for Grafana admin account
- Regularly update the Grafana OTEL container image
- Monitor CloudWatch logs for security events
Cost Estimation
Default configuration (1 task, 1 vCPU, 2GB RAM):
- ECS Fargate: ~$35-50/month
- CloudWatch Logs: ~$1-5/month (depending on log volume)
- Service Discovery: ~$0.50/month
Total estimated cost: ~$40-60/month
Support
For issues or questions:
- Check the module documentation:
../modules/grafana-otel/README.md - Review Terraform and AWS documentation
- Check CloudWatch logs for detailed error messages