CloudWatch Logs & Metrics

🎯 Mục tiêu Task 12: Setup CloudWatch Logs & Metrics cho monitoring toàn hệ thống - LOG TẬP TRUNG + DASHBOARD + ALARMS

🔍 Tóm Tắt Task 12 - Monitoring Complete

Task 12 thu thập logs & metrics từ toàn bộ hệ thống:

  • 📊 ECS Fargate: Container logs từ NestJS microservices
  • 🌐 API Gateway: Access logs + execution logs
  • 💾 DynamoDB: Performance metrics + CloudTrail integration
  • 🔄 CI/CD Pipeline: CodeBuild + CodePipeline logs
  • 🚨 Alarms: Cảnh báo khi vượt ngưỡng
  • 📈 Dashboard: Realtime monitoring view

🔗 Monitoring Architecture

ECS Fargate (NestJS) ─────┐
                          │
API Gateway ──────────────┼──→ CloudWatch Logs
                          │       │
DynamoDB ─────────────────┤       ├──→ CloudWatch Metrics
                          │       │       │
CI/CD Pipeline ───────────┘       │       ├──→ Alarms
                                  │       │
CloudTrail ───────────────────────┘       └──→ Dashboard

→ Log tập trung → Metrics → Alarms + Dashboard giám sát


1. ECS CloudWatch Logs

1.1. Enable ECS Container Logging

  1. Update ECS Task Definition để enable logging:
{
  "family": "vinashoes-user-service",
  "taskRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskRole",
  "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskExecutionRole",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [
    {
      "name": "user-service",
      "image": "ACCOUNT.dkr.ecr.ap-southeast-1.amazonaws.com/vinashoes-user-service:latest",
      "portMappings": [
        {
          "containerPort": 3000,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/vinashoes-user-service",
          "awslogs-region": "ap-southeast-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

1.2. Tạo CloudWatch Log Groups cho các services

📁 Log Groups Organization

Tạo separate log groups cho từng microservice:

  • User Service: /ecs/vinashoes-user-service
  • Product Service: /ecs/vinashoes-product-service
  • Order Service: /ecs/vinashoes-order-service
  • Cart Service: /ecs/vinashoes-cart-service
  • Payment Service: /ecs/vinashoes-payment-service
  1. CloudWatch Console → Logs → Create log group:
Log Group Settings:
  Log group name: "/ecs/vinashoes-user-service"
  Retention setting: 7 days (để tiết kiệm cost)
  
Repeat for all services:
  - /ecs/vinashoes-product-service
  - /ecs/vinashoes-order-service
  - /ecs/vinashoes-cart-service
  - /ecs/vinashoes-payment-service

1.3. Verify ECS Logs

  1. Restart ECS services để apply new task definition:
# Update ECS service với new task definition
aws ecs update-service \
  --cluster vinashoes-cluster \
  --service vinashoes-user-service \
  --task-definition vinashoes-user-service:LATEST \
  --region ap-southeast-1
  1. Check logs trong CloudWatch:
    • CloudWatch → Logs → Log groups → /ecs/vinashoes-user-service
    • Verify container logs xuất hiện

2. API Gateway Logging

2.1. Enable API Gateway CloudWatch Logs

🔐 API Gateway Logging Requirements API Gateway cần IAM role để write logs:

  • CloudWatchLogsRole: Allow API Gateway push logs
  • Account-level setting: Enable CloudWatch logs globally
  1. Tạo CloudWatch Logs Role cho API Gateway:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams",
        "logs:PutLogEvents",
        "logs:GetLogEvents",
        "logs:FilterLogEvents"
      ],
      "Resource": "*"
    }
  ]
}

2.2. Configure API Gateway Logging

  1. API Gateway Console → Settings:
CloudWatch log role ARN: "arn:aws:iam::ACCOUNT:role/APIGatewayCloudWatchLogsRole"

Default Route Settings:
  ✅ Detailed CloudWatch Metrics
  ✅ CloudWatch Logs
  Log Level: INFO
  ✅ Log full requests/responses data
  ✅ Data trace
  1. Enable logging cho specific stages:
Stage Configuration:
  Stage name: "prod"
  
Logs/Tracing:
  ✅ Enable CloudWatch Logs
  Log Level: INFO
  ✅ Log full requests/responses
  ✅ Enable detailed CloudWatch Metrics

3. DynamoDB CloudWatch Integration

3.1. Enable DynamoDB CloudWatch Metrics

  1. DynamoDB tự động gửi metrics về CloudWatch:
Default DynamoDB Metrics:
  - ConsumedReadCapacityUnits
  - ConsumedWriteCapacityUnits  
  - ProvisionedReadCapacityUnits
  - ProvisionedWriteCapacityUnits
  - ThrottledRequests
  - SystemErrors
  - UserErrors

3.2. Enable DynamoDB CloudTrail (Optional)

📝 DynamoDB Audit Logging CloudTrail capture DynamoDB API calls cho audit:

  • Management events: CreateTable, DeleteTable
  • Data events: GetItem, PutItem, Query, Scan (optional, có phí)
# CloudTrail config cho DynamoDB audit
CloudTrail Configuration:
  Trail name: "vinashoes-dynamodb-audit"
  
Event Type:
  ✅ Management events
  ⚠️ Data events (optional - có cost)
  
Storage Location:
  S3 bucket: "vinashoes-cloudtrail-logs"

4. CI/CD Pipeline Logging

4.1. CodeBuild CloudWatch Logs

CodeBuild tự động gửi build logs về CloudWatch:

CodeBuild Log Groups (tự động tạo):
  - /aws/codebuild/vinashoes-backend-build
  
Log Stream Format:
  - [build-id]/[phase]
  
Retention: 30 days (default)

4.2. CodePipeline CloudWatch Events

CodePipeline gửi execution events về CloudWatch:

Pipeline Events:
  - Pipeline execution started
  - Stage execution started/succeeded/failed
  - Action execution started/succeeded/failed
  
Event Targets:
  - CloudWatch Logs
  - SNS notifications (optional)
  - Lambda functions (optional)

5. CloudWatch Alarms

5.1. ECS Alarms

🚨 ECS Critical Alarms

Monitoring points quan trọng cho ECS:

  • CPU Usage > 80%: Service overload
  • Memory Usage > 90%: Memory leak risk
  • Service Count = 0: Service down
  1. Tạo ECS CPU Alarm:
Alarm Configuration:
  Alarm name: "ECS-UserService-HighCPU"
  Description: "User service CPU usage > 80%"
  
Metric:
  Namespace: "AWS/ECS"
  MetricName: "CPUUtilization"
  Dimensions:
    ServiceName: "vinashoes-user-service"
    ClusterName: "vinashoes-cluster"
  
Threshold:
  Comparison: "GreaterThanThreshold"
  Threshold: 80
  Evaluation Periods: 2 out of 2
  Period: 300 seconds (5 minutes)
  
Actions:
  Alarm: Send SNS notification
  OK: Send SNS notification

5.2. API Gateway Alarms

  1. API Gateway 5XX Errors Alarm:
Alarm Configuration:
  Alarm name: "APIGateway-High5XXErrors"
  Description: "API Gateway 5XX errors > 10 in 5 minutes"
  
Metric:
  Namespace: "AWS/ApiGateway"
  MetricName: "5XXError"
  Dimensions:
    ApiName: "vinashoes-api"
    Stage: "prod"
  
Threshold:
  Comparison: "GreaterThanThreshold"
  Threshold: 10
  Statistic: Sum
  Period: 300 seconds

5.3. DynamoDB Alarms

  1. DynamoDB Throttling Alarm:
Alarm Configuration:
  Alarm name: "DynamoDB-UserThrottling"
  Description: "DynamoDB User table throttling detected"
  
Metric:
  Namespace: "AWS/DynamoDB"
  MetricName: "ThrottledRequests"
  Dimensions:
    TableName: "User"
    Operation: "Query"
  
Threshold:
  Comparison: "GreaterThanThreshold"
  Threshold: 0
  Period: 300 seconds

6. CloudWatch Dashboard

6.1. Tạo Comprehensive Dashboard

📊 Dashboard Organization

Organize dashboard theo service layers:

  • Row 1: Infrastructure (ECS CPU, Memory)
  • Row 2: API Gateway (Requests, Latency, Errors)
  • Row 3: Database (DynamoDB Metrics)
  • Row 4: CI/CD (Build Success Rate, Deploy Frequency)
  1. CloudWatch Console → Dashboards → Create dashboard:
Dashboard Configuration:
  Dashboard name: "VinaShoesProductionMonitoring"
  
Widgets Configuration:
  - ECS Services Health (Line chart)
  - API Gateway Request Rate (Number widget)
  - DynamoDB Consumed Capacity (Stacked area)
  - CI/CD Pipeline Status (Number widget)

6.2. Dashboard Widget Examples

ECS Metrics Widget:

{
  "type": "metric",
  "width": 12,
  "height": 6,
  "properties": {
    "metrics": [
      ["AWS/ECS", "CPUUtilization", "ServiceName", "vinashoes-user-service", "ClusterName", "vinashoes-cluster"],
      ["...", "vinashoes-product-service", ".", "."],
      ["...", "vinashoes-order-service", ".", "."]
    ],
    "period": 300,
    "stat": "Average",
    "region": "ap-southeast-1",
    "title": "ECS Services CPU Utilization"
  }
}

API Gateway Metrics Widget:

{
  "type": "metric",
  "width": 12,
  "height": 6,
  "properties": {
    "metrics": [
      ["AWS/ApiGateway", "Count", "ApiName", "vinashoes-api", "Stage", "prod"],
      [".", "Latency", ".", ".", ".", "."],
      [".", "4XXError", ".", ".", ".", "."],
      [".", "5XXError", ".", ".", ".", "."]
    ],
    "period": 300,
    "stat": "Sum",
    "region": "ap-southeast-1",
    "title": "API Gateway Metrics"
  }
}

7. Task 12 Hoàn Thành!

📋 Checklist Deliverables

Component Status Details
✅ ECS Logs ACTIVE Container logs từ tất cả microservices
✅ API Gateway Logs ACTIVE Access logs + execution logs
✅ DynamoDB Metrics ACTIVE Performance metrics tự động
✅ CI/CD Logs ACTIVE CodeBuild + CodePipeline logs
✅ CloudWatch Alarms CONFIGURED CPU, Memory, 5XX, Throttling alarms
✅ Dashboard LIVE Realtime monitoring view

🔍 Monitoring Coverage

🎉 Complete Monitoring Setup!

Log Sources:

  • ECS Fargate: NestJS application logs
  • API Gateway: Request/response logs
  • DynamoDB: Performance metrics
  • CI/CD: Build and deployment logs

Alerting:

  • Performance: CPU > 80%, Memory > 90%
  • Errors: 5XX errors, DynamoDB throttling
  • Availability: Service count = 0

Visibility:

  • Real-time dashboard: Infrastructure + application metrics
  • Historical data: 7-30 days retention
  • Audit trail: CloudTrail integration

🚨 Alarm Thresholds Summary

Critical Alarms:
  ECS CPU Usage: > 80% for 5 minutes
  ECS Memory Usage: > 90% for 5 minutes
  API Gateway 5XX: > 10 errors in 5 minutes
  DynamoDB Throttling: > 0 throttled requests
  
Warning Alarms:
  ECS CPU Usage: > 60% for 10 minutes
  API Gateway Latency: > 2000ms average
  DynamoDB Consumed Capacity: > 80% of provisioned

💡 Monitoring Best Practices

🎯 Production Monitoring Tips

Log Management:

  1. Retention policy: 7 days cho dev, 30 days cho prod
  2. Log levels: INFO cho prod, DEBUG cho troubleshooting
  3. Structured logging: JSON format for better parsing

Alarm Strategy:

  1. Escalation: Warning → Critical → Page on-call
  2. Noise reduction: Avoid alarm fatigue với proper thresholds
  3. Recovery alarms: Alert when systems recover

Dashboard Design:

  1. Executive view: High-level KPIs và health status
  2. Technical view: Detailed metrics cho troubleshooting
  3. Mobile-friendly: Dashboard viewable trên mobile

Cost Optimization:

  1. Log retention: Auto-delete old logs
  2. Metric filters: Only track essential metrics
  3. Reserved capacity: For high-volume log ingestion

🔧 Troubleshooting Common Issues

Logging Issues

Problem: ECS logs không xuất hiện

# Check ECS task execution role permissions
aws iam get-role-policy \
  --role-name ecsTaskExecutionRole \
  --policy-name CloudWatchLogsPolicy
  
# Verify log group exists
aws logs describe-log-groups \
  --log-group-name-prefix "/ecs/vinashoes"

Problem: API Gateway logs missing

  • Check CloudWatchLogsRole ARN in API Gateway settings
  • Verify stage-level logging configuration
  • Check IAM permissions cho log writing

Alarm Issues

Problem: Alarms not triggering

  • Verify metric names và dimensions correct
  • Check alarm threshold values reasonable
  • Verify SNS topic permissions

Performance Optimization

Log Volume Management:

# Monitor log ingestion volume
aws logs describe-metric-filters \
  --log-group-name "/ecs/vinashoes-user-service"

# Set up log retention
aws logs put-retention-policy \
  --log-group-name "/ecs/vinashoes-user-service" \
  --retention-in-days 7

Next Task: Task 13 - Security & Compliance monitoring với AWS Config và CloudTrail 🚀


8. Dọn Dẹp Tài Nguyên

8.1. Xóa CloudWatch Log Groups

Xóa log groups cho tất cả services:

# Xóa ECS log groups
aws logs delete-log-group --log-group-name "/ecs/vinashoes-user-service"
aws logs delete-log-group --log-group-name "/ecs/vinashoes-product-service"
aws logs delete-log-group --log-group-name "/ecs/vinashoes-order-service"
aws logs delete-log-group --log-group-name "/ecs/vinashoes-cart-service"
aws logs delete-log-group --log-group-name "/ecs/vinashoes-payment-service"

# Xóa API Gateway log groups
aws logs delete-log-group --log-group-name "API-Gateway-Execution-Logs_vinashoes-api/prod"
aws logs delete-log-group --log-group-name "API-Gateway-Access-Logs_vinashoes-api/prod"

# Xóa CodeBuild log groups
aws logs delete-log-group --log-group-name "/aws/codebuild/vinashoes-backend-build"

8.2. Xóa CloudWatch Alarms

Xóa tất cả monitoring alarms:

# Xóa ECS alarms
aws cloudwatch delete-alarms --alarm-names \
  "ECS-UserService-HighCPU" \
  "ECS-UserService-HighMemory" \
  "ECS-ServiceCount-Zero"

# Xóa API Gateway alarms
aws cloudwatch delete-alarms --alarm-names \
  "APIGateway-High5XXErrors" \
  "APIGateway-HighLatency"

# Xóa DynamoDB alarms
aws cloudwatch delete-alarms --alarm-names \
  "DynamoDB-UserThrottling" \
  "DynamoDB-HighConsumedCapacity"

8.3. Xóa CloudWatch Dashboard

Xóa monitoring dashboard:

aws cloudwatch delete-dashboards --dashboard-names "VinaShoesProductionMonitoring"

8.4. Tắt API Gateway Logging

Disable CloudWatch logging cho API Gateway:

# Tắt logging cho stage prod
aws apigateway update-stage \
  --rest-api-id YOUR_API_ID \
  --stage-name prod \
  --patch-op op=replace,path=/methodSettings/*/*/loggingLevel,value=OFF \
  --patch-op op=replace,path=/methodSettings/*/*/metricsEnabled,value=false

8.5. Xóa IAM Roles

Xóa CloudWatch permissions từ ECS task roles:

# Detach CloudWatch policies từ ECS task execution role
aws iam detach-role-policy \
  --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess

# Xóa API Gateway CloudWatch role
aws iam detach-role-policy \
  --role-name APIGatewayCloudWatchLogsRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/APIGatewayCloudWatchLogsRole

aws iam delete-role --role-name APIGatewayCloudWatchLogsRole

⚠️ Thứ Tự Dọn Dẹp CloudWatch:

  1. Xóa alarms và dashboard
  2. Xóa log groups (sẽ mất tất cả logs)
  3. Tắt API Gateway logging
  4. Xóa IAM permissions

9. Phân Tích Chi Phí

9.1. Tổng Quan Giá CloudWatch

Cấu trúc giá CloudWatch:

Thành Phần Dịch Vụ Miễn Phí Trả Phí Ước Tính Chi Phí
Logs Ingestion 5GB/tháng $0.50/GB $10-50/tháng
Logs Storage - $0.03/GB/tháng $3-10/tháng
Metrics 10 metrics $0.30/metric/tháng $5-15/tháng
Alarms - $0.10/alarm/tháng $3-10/tháng
Dashboard 3 dashboards $3/dashboard/tháng $3/tháng
API Requests 1M requests $0.01/1K requests $1-5/tháng

9.2. Chi Tiết Chi Phí Hàng Tháng

Ước tính chi phí cho e-commerce platform:

Chi Phí Cơ Bản CloudWatch:
  Logs Ingestion: $25/tháng (50GB logs)
  Logs Storage: $5/tháng (150GB stored)
  Custom Metrics: $10/tháng (30 metrics)
  Alarms: $5/tháng (50 alarms)
  Dashboard: $3/tháng (1 dashboard)
  
Monitoring & Alerting:
  API Requests: $2/tháng (200K requests)
  Cross-region: $1/tháng (minimal)
  
Tổng Chi Phí Hàng Tháng: $51/tháng

9.3. Chiến Lược Tối Ưu Chi Phí

Giảm chi phí CloudWatch:

Chiến Thuật Tối Ưu:
  1. Log Retention:
     - ECS logs: 7 ngày retention
     - API Gateway: 30 ngày cho prod
     - Archive old logs to S3 Glacier
     
  2. Sampling & Filtering:
     - Enable log sampling cho high-volume services
     - Use metric filters thay vì storing all logs
     
  3. Alarm Optimization:
     - Combine related alarms
     - Use composite alarms để giảm số lượng
     
  4. Dashboard Efficiency:
     - Use single dashboard với multiple widgets
     - Remove unused metrics từ dashboard

9.4. Phân Tích ROI

Lợi Ích Monitoring vs Chi Phí:

Loại Lợi Ích Giá Trị Tác Động Chi Phí
MTTR Reduction Giảm 70% thời gian fix issues $50K+ mỗi outage
Performance Optimization Cải thiện response time 30% $20K+ mỗi giây chậm
Proactive Monitoring Phát hiện issues trước khi user impacted $100K+ downtime prevention
Operational Efficiency Tự động alerts giảm manual monitoring 10 giờ/tuần tiết kiệm
Compliance Audit trails cho security compliance Vô giá trị

Tính Toán ROI:

  • Chi Phí Hàng Năm: $612 (51/tháng × 12)
  • Lợi Ích Hàng Năm: $500K+ (outage prevention + efficiency)
  • ROI: 81,700% (lợi ích ÷ chi phí)

9.5. Giám Sát Chi Phí

Theo dõi chi tiêu CloudWatch:

# Kiểm tra chi phí CloudWatch
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE \
  --filter '{
    "Dimensions": {
      "Key": "SERVICE",
      "Values": ["AmazonCloudWatch"]
    }
  }'

# Giám sát log volume
aws logs describe-log-groups \
  --query 'logGroups[*].{logGroupName:logGroupName,storedBytes:storedBytes}' \
  --output table

# Check metrics usage
aws cloudwatch list-metrics \
  --namespace "AWS/ECS" \
  --query 'Metrics[*].{MetricName:MetricName,Dimensions:Dimensions}'

💡 Thực Tiễn Quản Lý Chi Phí Tốt Nhất

Log Management:

  • Set retention policies dựa trên compliance requirements
  • Use log archiving cho long-term storage
  • Implement log rotation strategies

Cost Monitoring:

  • Set billing alerts cho $100/tháng threshold
  • Monitor log ingestion rates hàng tuần
  • Review unused log groups monthly

Optimization:

  • Use CloudWatch Insights cho ad-hoc queries thay vì storing everything
  • Implement log sampling cho high-volume applications
  • Leverage AWS Organizations cho consolidated billing

Scaling Considerations:

  • High-traffic: Chi phí logs scale với request volume
  • Multi-region: Cross-region logs add data transfer costs
  • Multi-account: Centralized monitoring có thể tăng chi phí

🚀 Production-Ready AWS Microservices Platform with Complete Observability! 🚀