Best Practices Guide

Backup Strategy

The 3-2-1 Rule

Follow the industry-standard 3-2-1 backup rule adapted for infrastructure:

3 copies of your state files
2 different storage media types
1 off-site backup location

Implementation Example

# Primary: S3 with versioning
terraback config set storage.type s3
terraback config set storage.s3.versioning true

# Secondary: Local encrypted backup
terraback backup export --all --encrypted /mnt/backup/

# Tertiary: Git repository
terraback git init --remote https://github.com/company/terraform-states

Backup Frequency Guidelines

Environment	Recommended Frequency	Retention Period	Justification
Production	Every hour	90 days	Critical infrastructure needs frequent snapshots
Staging	Every 6 hours	30 days	Balance between protection and storage
Development	Daily	7 days	Lower change frequency, less critical
CI/CD	On every apply	30 days	Track all automated changes

Automated Backup Configuration

# Production environment
terraback schedule create prod-hourly \
  --interval 1h \
  --retention 90d \
  --tags "production,automated" \
  --workspace production

# Staging environment
terraback schedule create staging-6h \
  --interval 6h \
  --retention 30d \
  --tags "staging,automated" \
  --workspace staging

# Development environment
terraback schedule create dev-daily \
  --interval 24h \
  --retention 7d \
  --tags "development,automated" \
  --workspace development

Pre and Post Operation Backups

# .terraback/config.yml
backup:
  automatic:
    on_apply: true       # Backup before terraform apply
    on_destroy: true     # Backup before terraform destroy
    on_import: true      # Backup before terraform import
    on_state_rm: true    # Backup before state manipulation

  hooks:
    pre_apply: "terraback checkpoint create --name pre-apply-$(date +%Y%m%d-%H%M%S)"
    post_apply: "terraback backup create --name post-apply-$(date +%Y%m%d-%H%M%S)"

Security Best Practices

Encryption Standards

Always Enable Encryption

# Encryption at rest
storage:
  encryption:
    enabled: true
    algorithm: "AES256-GCM"
    key_management: "aws-kms"  # Use cloud KMS when possible

# Encryption in transit
network:
  tls:
    enabled: true
    min_version: "1.2"
    verify_certificates: true

Key Management

# Use cloud provider KMS
terraback config set storage.s3.kms_key_id "arn:aws:kms:us-east-1:123456789012:key/..."

# Rotate encryption keys regularly
terraback security rotate-keys --schedule monthly

# Never store keys in configuration files
export TERRABACK_ENCRYPTION_KEY=$(vault read -field=key secret/terraback)

Access Control

Principle of Least Privilege

# Role-based access control
security:
  rbac:
    - role: "infrastructure-admin"
      permissions: ["all"]
      users: ["alice@company.com"]

    - role: "developer"
      permissions: ["backup:create", "backup:list", "restore:preview"]
      users: ["*@dev.company.com"]

    - role: "auditor"
      permissions: ["read"]
      users: ["audit-team@company.com"]

Multi-Factor Authentication

# Enable MFA requirement
terraback security mfa enable

# Configure TOTP
terraback security mfa setup --type totp

# Enforce for production operations
terraback config set security.production.require_mfa true

Audit Logging

# Comprehensive audit configuration
audit:
  enabled: true
  destinations:
    - type: "file"
      path: "/var/log/terraback/audit.log"
      format: "json"
      rotation: "daily"

    - type: "siem"
      endpoint: "https://siem.company.com"
      api_key: "${SIEM_API_KEY}"

  events:
    - "backup.*"
    - "restore.*"
    - "config.change"
    - "auth.*"
    - "drift.detected"

  compliance:
    standards: ["SOC2", "ISO27001", "HIPAA"]
    retention: "7 years"

Drift Detection

Continuous Monitoring Setup

# Enable drift detection with optimal settings
terraback drift enable \
  --interval 15m \
  --alert-threshold medium \
  --auto-remediate safe

# Configure intelligent drift detection
terraback drift config set detection.smart_mode true
terraback drift config set detection.ignore_tags true
terraback drift config set detection.business_hours "08:00-18:00"

Drift Response Playbook

# Drift response automation
drift:
  response:
    low_severity:
      action: "log"
      notification: "email_summary"

    medium_severity:
      action: "alert"
      notification: "slack"
      require_acknowledgment: true

    high_severity:
      action: "page"
      notification: ["slack", "pagerduty", "email"]
      auto_create_ticket: true

    critical_severity:
      action: "block_changes"
      notification: ["all"]
      escalation: "immediate"

Handling Drift

# 1. Investigate drift
terraback drift show --detailed

# 2. Generate report for review
terraback drift report --format html > drift-report.html

# 3. Decision tree:
if [[ $(terraback drift severity) == "critical" ]]; then
    # Create incident
    terraback incident create --severity critical
    # Restore immediately
    terraback restore last-known-good --emergency
else
    # Review and plan remediation
    terraback drift fix --dry-run
    # Apply fixes after review
    terraback drift fix --approved-by security-team
fi

Performance Optimization

Large State Files

Optimization Strategies

# For state files > 100MB
performance:
  large_state_optimization:
    enabled: true
    chunking:
      enabled: true
      size: "10MB"
    compression:
      algorithm: "zstd"  # Better for large files
      level: 3  # Balance speed/compression
    parallel_processing:
      enabled: true
      workers: 8

State Splitting

# Split large monolithic state
terraback state split \
  --strategy "resource-type" \
  --output-dir ./split-states/

# Or split by module
terraback state split \
  --strategy "module" \
  --preserve-dependencies

Network Optimization

network:
  optimization:
    # Connection pooling
    connection_pool:
      size: 20
      timeout: "30s"
      keepalive: true

    # Bandwidth management
    bandwidth:
      upload_limit: "100Mbps"
      download_limit: "200Mbps"
      burst_allowance: 1.5

    # Regional optimization
    edge_locations:
      enabled: true
      prefer_closest: true

Storage Optimization

# Enable intelligent tiering
terraback storage optimize --enable-tiering

# Configure lifecycle policies
terraback storage lifecycle create \
  --move-to-cool-after 30d \
  --move-to-archive-after 90d \
  --delete-after 365d

# Deduplicate backups
terraback storage dedupe --aggressive

Disaster Recovery

RTO and RPO Targets

Scenario	RTO Target	RPO Target	Strategy
Critical Production	< 5 min	< 1 hour	Hot standby with continuous backup
Standard Production	< 30 min	< 4 hours	Automated recovery with recent backup
Non-Production	< 2 hours	< 24 hours	Manual recovery with daily backup

Recovery Testing

# Schedule regular recovery drills
terraback dr schedule-drill \
  --frequency monthly \
  --environments "staging,dr-test" \
  --notification "dr-team@company.com"

# Automated recovery testing
terraback dr test \
  --scenario "complete-failure" \
  --target-rto "30m" \
  --report-output ./dr-test-results/

Emergency Recovery Procedures

#!/bin/bash
# emergency-recovery.sh

# 1. Assess the situation
terraback status --emergency-mode

# 2. Find last known good state
LAST_GOOD=$(terraback backup list \
  --before-incident \
  --verified \
  --limit 1)

# 3. Create emergency checkpoint
terraback checkpoint create \
  --name "emergency-$(date +%s)" \
  --priority critical

# 4. Perform recovery
terraback restore apply $LAST_GOOD \
  --emergency \
  --skip-validation \
  --notify-stakeholders

# 5. Verify recovery
terraback verify --comprehensive

# 6. Document incident
terraback incident document \
  --template post-mortem \
  --attach-logs

CI/CD Integration

Pipeline Best Practices

GitHub Actions Example

# .github/workflows/terraform-backup.yml
name: Terraform State Management

on:
  push:
    branches: [main, production]
  pull_request:
    types: [opened, synchronized]
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours

jobs:
  backup:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Terraback
        run: |
          curl -sSL https://terraback.io/install.sh | bash
          terraback version

      - name: Configure Terraback
        env:
          TERRABACK_API_KEY: ${{ secrets.TERRABACK_API_KEY }}
        run: |
          terraback config import .terraback/ci-config.yml

      - name: Pre-change Backup
        if: github.event_name == 'pull_request'
        run: |
          terraback checkpoint create \
            --name "pr-${{ github.event.pull_request.number }}" \
            --tags "ci,pr,automated"

      - name: Check Drift
        run: |
          terraback drift check --fail-on-drift || {
            echo "Drift detected, creating issue"
            terraback github issue create \
              --title "Drift detected in ${{ github.ref }}" \
              --labels "drift,automated"
          }

      - name: Terraform Plan
        id: plan
        run: terraform plan -out=tfplan

      - name: Backup After Plan
        if: steps.plan.outcome == 'success'
        run: |
          terraback backup create \
            --name "post-plan-${{ github.sha }}" \
            --attach-plan tfplan

      - name: Apply Changes
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: |
          terraform apply -auto-approve tfplan
          terraback backup create \
            --name "post-apply-${{ github.sha }}" \
            --tags "production,applied"

GitOps Workflow

# terraback-gitops.yml
gitops:
  enabled: true
  repository: "github.com/company/infrastructure"
  branch: "main"

  sync:
    interval: "5m"
    auto_commit: true
    auto_push: true

  state_tracking:
    path: "states/"
    format: "encrypted-json"

  approval:
    required_reviewers: 2
    approved_by: ["sre-team", "security-team"]

  rollback:
    auto_rollback_on_failure: true
    keep_failed_attempts: true

Team Collaboration

Shared Configuration

# team-config.yml
team:
  name: "infrastructure"

  shared_settings:
    backup_prefix: "team-infra-"
    default_tags:
      team: "infrastructure"
      managed_by: "terraback"
      cost_center: "engineering"

  permissions:
    default: "read"
    leads: ["alice@company.com", "bob@company.com"]

  notifications:
    channels:
      slack: "#infrastructure"
      email: "infra-team@company.com"

  workspace_ownership:
    production: ["sre-team"]
    staging: ["dev-team", "sre-team"]
    development: ["all"]

Conflict Prevention

# Enable state locking
terraback config set state.locking.enabled true
terraback config set state.locking.timeout "5m"

# Configure operation queueing
terraback config set operations.queue.enabled true
terraback config set operations.queue.max_wait "10m"

# Set up mutex for critical operations
terraback mutex acquire --resource "production-apply"
terraform apply
terraback mutex release --resource "production-apply"

Monitoring and Alerting

Comprehensive Monitoring Setup

monitoring:
  health_checks:
    - name: "backup_freshness"
      interval: "1h"
      condition: "last_backup < 2h"
      alert_if_fails: 2

    - name: "storage_capacity"
      interval: "6h"
      condition: "usage < 80%"
      alert_if_fails: 1

    - name: "drift_detection"
      interval: "15m"
      condition: "drift_count == 0"
      alert_if_fails: 3

  sla_monitoring:
    backup_success_rate: 99.9
    recovery_success_rate: 99.5
    api_availability: 99.95

  dashboards:
    - type: "grafana"
      url: "https://grafana.company.com"
      dashboards:
        - "terraback-overview"
        - "backup-metrics"
        - "drift-analysis"

Alert Fatigue Prevention

alerting:
  smart_alerts:
    enabled: true

    deduplication:
      window: "5m"
      key: ["resource", "severity"]

    suppression:
      maintenance_window: "Sat 02:00-06:00"
      repeated_alerts: "exponential_backoff"

    correlation:
      group_related: true
      time_window: "10m"

    escalation:
      levels:
        - delay: "0m"
          notify: ["on-call"]
        - delay: "15m"
          notify: ["team-lead"]
        - delay: "30m"
          notify: ["manager"]

Cost Optimization

Storage Cost Management

# Analyze storage costs
terraback cost analyze --breakdown-by age,size,type

# Implement cost-saving policies
terraback cost optimize \
  --enable-compression \
  --enable-deduplication \
  --lifecycle-transitions \
  --remove-redundant

# Set cost alerts
terraback cost alert create \
  --threshold "$100/month" \
  --notify "finance@company.com"

Intelligent Retention

retention:
  intelligent:
    enabled: true

    rules:
      - name: "production_critical"
        pattern: "*production*critical*"
        keep_forever: true

      - name: "releases"
        pattern: "*release*"
        keep_days: 365

      - name: "daily_backups"
        pattern: "*daily*"
        keep_days: 30
        thin_after: 7  # Keep only one per day after 7 days

      - name: "development"
        pattern: "*dev*"
        keep_days: 7

Compliance and Governance

Regulatory Compliance

compliance:
  frameworks:
    - name: "SOC2"
      controls:
        - encryption_at_rest: true
        - encryption_in_transit: true
        - audit_logging: true
        - access_control: true
        - retention_policy: "7 years"

    - name: "HIPAA"
      controls:
        - phi_encryption: "AES256"
        - access_logs: true
        - backup_encryption: true
        - disaster_recovery: true

    - name: "GDPR"
      controls:
        - data_location: ["EU"]
        - right_to_deletion: true
        - data_portability: true
        - breach_notification: "72 hours"

Governance Policies

# Enforce governance policies
terraback governance enable

# Define policies
terraback governance policy create \
  --name "backup-requirements" \
  --rule "all production workspaces must have hourly backups" \
  --enforcement "block"

# Audit compliance
terraback governance audit \
  --report-format "pdf" \
  --output "./compliance-report.pdf"

Troubleshooting Guidelines

Diagnostic Runbook

# Step 1: Check system health
terraback doctor --comprehensive

# Step 2: Verify connectivity
terraback test connectivity --all

# Step 3: Validate configuration
terraback config validate --strict

# Step 4: Check permissions
terraback permissions verify

# Step 5: Review recent operations
terraback logs tail --lines 100 --level error

# Step 6: Generate support bundle
terraback support bundle \
  --include-logs \
  --include-config \
  --output support-$(date +%Y%m%d).tar.gz

Maintenance and Updates

Regular Maintenance Tasks

#!/bin/bash
# maintenance.sh - Run weekly

# Update Terraback
terraback update check
terraback update apply --auto-approve

# Cleanup old backups
terraback backup prune \
  --older-than 90d \
  --keep-tags "release,critical" \
  --dry-run

# Verify backup integrity
terraback verify --all --last 7d

# Optimize storage
terraback storage optimize --defragment

# Update documentation
terraback docs generate --format markdown

# Health report
terraback report generate --type weekly \
  --email "team@company.com"

Version Management

version_policy:
  auto_update:
    enabled: true
    channel: "stable"  # stable, beta, edge
    schedule: "Sunday 02:00"

  compatibility:
    terraform_versions:
      minimum: "0.12.0"
      maximum: "1.6.x"
      warn_on_unsupported: true

  rollback:
    keep_previous: 3
    auto_rollback_on_error: true

Backup Strategy​

The 3-2-1 Rule​

Implementation Example​

Backup Frequency Guidelines​

Automated Backup Configuration​

Pre and Post Operation Backups​

Security Best Practices​

Encryption Standards​

Always Enable Encryption​

Key Management​

Access Control​

Principle of Least Privilege​

Multi-Factor Authentication​

Audit Logging​

Drift Detection​

Continuous Monitoring Setup​

Drift Response Playbook​

Handling Drift​

Performance Optimization​

Large State Files​

Optimization Strategies​

State Splitting​

Network Optimization​

Storage Optimization​

Disaster Recovery​

RTO and RPO Targets​

Recovery Testing​

Emergency Recovery Procedures​

CI/CD Integration​

Pipeline Best Practices​

GitHub Actions Example​

GitOps Workflow​

Team Collaboration​

Shared Configuration​

Conflict Prevention​

Monitoring and Alerting​

Comprehensive Monitoring Setup​

Alert Fatigue Prevention​

Cost Optimization​

Storage Cost Management​

Intelligent Retention​

Compliance and Governance​

Regulatory Compliance​

Governance Policies​

Troubleshooting Guidelines​

Diagnostic Runbook​

Maintenance and Updates​

Regular Maintenance Tasks​

Version Management​

Backup Strategy

The 3-2-1 Rule

Implementation Example

Backup Frequency Guidelines

Automated Backup Configuration

Pre and Post Operation Backups

Security Best Practices

Encryption Standards

Always Enable Encryption

Key Management

Access Control

Principle of Least Privilege

Multi-Factor Authentication

Audit Logging

Drift Detection

Continuous Monitoring Setup

Drift Response Playbook

Handling Drift

Performance Optimization

Large State Files

Optimization Strategies

State Splitting

Network Optimization

Storage Optimization

Disaster Recovery

RTO and RPO Targets

Recovery Testing

Emergency Recovery Procedures

CI/CD Integration

Pipeline Best Practices

GitHub Actions Example

GitOps Workflow

Team Collaboration

Shared Configuration

Conflict Prevention

Monitoring and Alerting

Comprehensive Monitoring Setup

Alert Fatigue Prevention

Cost Optimization

Storage Cost Management

Intelligent Retention

Compliance and Governance

Regulatory Compliance

Governance Policies

Troubleshooting Guidelines

Diagnostic Runbook

Maintenance and Updates

Regular Maintenance Tasks

Version Management