Feature Toggles in Infrastructure as Code

Feature toggles (also called feature flags) are a powerful technique, allowing teams to modify infrastructure behavior without changing code. They become particularly valuable when managing Infrastructure as Code (IaC) with tools like OpenTofu and Terraform, both of which provide a rich ecosystem of features to implement feature toggles for your IaC projects.

Note: This article covers patterns that work with both OpenTofu and Terraform, highlighting differences where they exist.

[Diagram: Infrastructure Evolution]

A Toggling Tale

Here's a common scenario: You're on a platform team at a mid-sized fintech company that has been tasked with creating a comprehensive GitHub landing zone—a standardized, secure foundation for all your organization's repositories. Your users, various product development teams at the company, are particularly excited about one feature: an intelligent Pull Request Reminder system that will automatically notify the right reviewers at the right time, escalate stale PRs, and even integrate with your organization's calendars to find the perfect review windows.

Sarah, your team lead, enthusiastically dubs this the "PR Reminder" feature. It sounds simple enough at first, but as the team digs in, they realize it involves complex logic including timezone calculations, team availability patterns, and integration with multiple external systems. The feature requires Infrastructure as Code configurations that interact with the GitHub provider, AWS Lambda functions for the notification logic, and EventBridge rules for scheduling.

The challenge here isn't just technical—it's organizational. Your team needs to deliver value quickly to the product teams while managing the risk of an experimental feature. This tension between speed and safety is where feature toggles become invaluable.

The Initial Implementation

As an Infrastructure as Code developer on the team, you start with a straightforward approach, and branch off from main and begin defining the PR Reminder infrastructure into the codebase:

# Works in both Terraform and OpenTofu
resource "github_repository" "team_repo" {
  name = "payment-service"
  description = "Payment processing microservice"

  visibility = "private"  # Fintech repos should be private for security

  template {
    owner                = "github"
    repository           = "terraform-template-module"
    include_all_branches = true
  }

  pages {
    source {
      branch = "master"
      path   = "/docs"
    }
  }

}

# PR Reminder feature - still working on this
# Nearly there...
# Note: Define the Lambda function URL resource
# resource "aws_lambda_function_url" "pr_reminder" {
#   function_name      = aws_lambda_function.pr_reminder.function_name
#   authorization_type = "AWS_IAM"
# }

resource "github_repository_webhook" "pr_reminder" {
  repository = github_repository.team_repo.name

  configuration {
    url          = aws_lambda_function_url.pr_reminder.function_url  # Assumes URL resource is defined
    content_type = "json"
    insecure_ssl = false
  }

  events = ["pull_request", "pull_request_review"]
}

# Note: Define the IAM role with appropriate Lambda execution permissions
# resource "aws_iam_role" "pr_reminder" {
#   name = "pr-reminder-lambda-role"
#   assume_role_policy = jsonencode({
#     Version = "2012-10-17"
#     Statement = [{
#       Action = "sts:AssumeRole"
#       Principal = { Service = "lambda.amazonaws.com" }
#       Effect = "Allow"
#     }]
#   })
# }

# Data sources for secure credential retrieval
# data "aws_secretsmanager_secret_version" "github_token" {
#   secret_id = "github-token"
# }
# data "aws_secretsmanager_secret_version" "slack_webhook" {
#   secret_id = "slack-webhook-url" 
# }

resource "aws_lambda_function" "pr_reminder" {
  filename      = "pr_reminder.zip"
  function_name = "pr-reminder-${github_repository.team_repo.name}"
  role          = aws_iam_role.pr_reminder.arn  # Assumes IAM role is defined
  handler       = "index.handler"
  runtime       = "nodejs18.x"

  # Note: Consider adding KMS encryption for environment variables
  # kms_key_arn = data.aws_kms_key.lambda.arn

  environment {
    variables = {
      # Use AWS Secrets Manager for sensitive values
      GITHUB_TOKEN = data.aws_secretsmanager_secret_version.github_token.secret_string
      SLACK_WEBHOOK = data.aws_secretsmanager_secret_version.slack_webhook.secret_string
      # Complex configuration for reminder logic
      REMINDER_INTERVALS = "2h,6h,24h,48h"
      ESCALATION_THRESHOLD = "72h"
    }
  }
}

After a few weeks of development, the new feature is partially working but far from complete. The timezone logic is buggy, the EventBridge rules are not firing off consistently, and the integration with the corporate calendar hasn't even been started. Meanwhile, the product teams are urgently requesting the need for even a basic GitHub landing zone to be deployed.

Enter Feature Toggles

Sarah, your manager, realizes your team needs to figure out how to get the stable parts of the landing zone delivered while keeping the experimental PR Reminder feature hidden. She introduces the idea of using a feature toggle—a boolean variable that, when true, would enable a particular resource, and when false would not.

This approach solves the immediate problem: the team can deploy their infrastructure code to production with the PR Reminder feature safely hidden behind a toggle. Product teams get their landing zone immediately, while development continues on the complex reminder system. Here's how the implementation looks:

variable "enable_pr_reminder" {
  description = "Enable the experimental PR Reminder feature"
  type        = bool
  default     = false
}

# Note: Define the Lambda function URL resource
# resource "aws_lambda_function_url" "pr_reminder" {
#   count              = var.enable_pr_reminder ? 1 : 0
#   function_name      = aws_lambda_function.pr_reminder[0].function_name
#   authorization_type = "AWS_IAM"
# }

resource "github_repository_webhook" "pr_reminder" {
  count      = var.enable_pr_reminder ? 1 : 0
  repository = github_repository.team_repo.name

  configuration {
    url          = aws_lambda_function_url.pr_reminder[0].function_url  # Assumes URL resource is defined
    content_type = "json"
    insecure_ssl = false
  }

  events = ["pull_request", "pull_request_review"]
}

# Note: Define the IAM role with appropriate Lambda execution permissions
# resource "aws_iam_role" "pr_reminder" {
#   count = var.enable_pr_reminder ? 1 : 0
#   name  = "pr-reminder-lambda-role-${count.index}"
#   assume_role_policy = jsonencode({
#     Version = "2012-10-17"
#     Statement = [{
#       Action = "sts:AssumeRole"
#       Principal = { Service = "lambda.amazonaws.com" }
#       Effect = "Allow"
#     }]
#   })
# }

resource "aws_lambda_function" "pr_reminder" {
  count         = var.enable_pr_reminder ? 1 : 0
  filename      = "pr_reminder.zip"
  function_name = "pr-reminder-${github_repository.team_repo.name}"
  role          = aws_iam_role.pr_reminder[count.index].arn  # Assumes IAM role is defined with same count
  handler       = "index.handler"
  runtime       = "nodejs18.x"

  # KMS encryption for Lambda environment variables containing sensitive data
  kms_key_arn = data.aws_kms_key.lambda.arn

  environment {
    variables = {
      # Use AWS Secrets Manager for sensitive values
      GITHUB_TOKEN = data.aws_secretsmanager_secret_version.github_token.secret_string
      SLACK_WEBHOOK = data.aws_secretsmanager_secret_version.slack_webhook.secret_string
      REMINDER_INTERVALS = "2h,6h,24h,48h"
      ESCALATION_THRESHOLD = "72h"
    }
  }
}

While simple in its approach, using this conditional variable along with Terraform's (or OpenTofu's) count parameter allows the team to speed up the release of stable landing zone features to production without the fear of the fragile PR reminder feature failing at a critical time. Additionally, developers only need to set a single variable to true in order to turn the feature back on in their development environment—no need for duplicating codebases.

In feature toggle terminology, this conditional boolean variable would be referred to as a "release toggle," one of the four types of toggles defined in feature toggle development. But why does this distinction matter? Understanding the different categories of toggles helps you choose the right approach for your specific use case and manage the lifecycle of each toggle appropriately.

Categories of Toggles

[Diagram: Infrastructure Evolution]

Release Toggles

Release Toggles allow teams to separate deployment of infrastructure code from the release of infrastructure features. They're particularly valuable in Infrastructure as Code because infrastructure changes can be high-risk and difficult to roll back quickly.

In our PR Reminder example, the initial enable_pr_reminder variable was a classic Release Toggle:

variable "enable_pr_reminder" {
  description = "Enable the experimental PR Reminder feature"
  type        = bool
  default     = false
}

resource "github_repository_webhook" "pr_reminder" {
  count = var.enable_pr_reminder ? 1 : 0
  # ... configuration ...
}

Release Toggles in infrastructure are typically:

Short-lived in terms of longevity (days to weeks)
Binary in nature (on/off with no gradation or nuance)
Removed after release (cleaned up once the feature is stable)

A more common example of a release toggle might be toggling a new automated backup system:

variable "enable_new_backup_system" {
  description = "Enable the new S3-based backup system"
  type        = bool
  default     = false
}

# Note: Define the backup vault resource
# resource "aws_backup_vault" "main" {
#   count = var.enable_new_backup_system ? 1 : 0
#   name  = "main-backup-vault"
# }

resource "aws_backup_plan" "new_system" {
  count = var.enable_new_backup_system ? 1 : 0
  name  = "automated-backup-plan"

  rule {
    rule_name         = "daily_backups"
    target_vault_name = aws_backup_vault.main[0].name  # Assumes vault resource is defined
    schedule          = "cron(0 5 ? * * *)"

    lifecycle {
      delete_after = 30
    }
  }
}

Until the enable_new_backup_system variable is set to true, the new aws_backup_plan is deployed with the Infrastructure as Code, but the feature is not enabled until the toggle is set to true

Experiment Toggles

Experiment Toggles facilitate A/B testing of infrastructure configurations. They're used to gather data about different infrastructure approaches and make data-driven decisions about the best configuration.

Our PR Reminder A/B test exemplifies this pattern:

variable "database_performance_experiment" {
  description = "A/B test for database performance settings"
  type        = string
  default     = "control"

  validation {
    condition     = contains(["control", "high_iops", "high_memory"], var.database_performance_experiment)
    error_message = "Must be control, high_iops, or high_memory"
  }
}

resource "aws_db_instance" "application_db" {
  identifier = "app-database"

  # Experiment with different instance classes
  instance_class = {
    control     = "db.t3.medium"
    high_iops   = "db.m5.large"
    high_memory = "db.r5.large"
  }[var.database_performance_experiment]

  # Experiment with storage configurations
  allocated_storage = var.database_performance_experiment == "high_iops" ? 200 : 100
  iops              = var.database_performance_experiment == "high_iops" ? 3000 : null

  tags = {
    Experiment = var.database_performance_experiment
    Purpose    = "performance-testing"
  }
}

Experiment Toggles typically:

Have multiple states (not just on/off)
Include measurement (tagged for metrics collection)
Are time-bounded (removed after statistical significance is reached)

Ops Toggles

Ops Toggles provide operational control over infrastructure behavior, acting as circuit breakers or kill switches for infrastructure features. They allow operations teams to respond quickly to incidents without code changes.

variable "ops_controls" {
  description = "Operational control flags"
  type = object({
    enable_auto_scaling    = bool
    enable_public_access   = bool
    maintenance_mode       = bool
    rate_limit_multiplier  = number
  })
  default = {
    enable_auto_scaling    = true
    enable_public_access   = true
    maintenance_mode       = false
    rate_limit_multiplier  = 1.0
  }
}

resource "aws_autoscaling_group" "web_tier" {
  count = var.ops_controls.enable_auto_scaling ? 1 : 0

  min_size         = var.ops_controls.maintenance_mode ? 1 : 3
  max_size         = var.ops_controls.maintenance_mode ? 2 : 20
  desired_capacity = var.ops_controls.maintenance_mode ? 1 : 6

  # ... other configuration ...
}

resource "aws_security_group_rule" "public_https" {
  count = var.ops_controls.enable_public_access && !var.ops_controls.maintenance_mode ? 1 : 0

  type              = "ingress"
  from_port         = 443
  to_port           = 443
  protocol          = "tcp"
  cidr_blocks       = ["0.0.0.0/0"]
  security_group_id = aws_security_group.web.id
}

resource "aws_api_gateway_usage_plan" "api_limit" {
  name = "standard-limits"

  throttle_settings {
    rate_limit  = 1000 * var.ops_controls.rate_limit_multiplier
    burst_limit = 2000 * var.ops_controls.rate_limit_multiplier
  }
}

Ops Toggles are characterized by:

Long-lived (may exist for months or permanently)
Runtime modifiable (can be changed without deployment)
Incident-response focused (designed for operational needs)

Permission Toggles

Permission Toggles control access to infrastructure resources based on user attributes, team membership, or other criteria. They enable gradual rollout of infrastructure access and premium features.

variable "access_controls" {
  description = "Permission-based access controls"
  type = object({
    premium_repos_enabled = bool
    admin_features_enabled = bool
    allowed_teams         = list(string)
    beta_users           = list(string)
  })
  default = {
    premium_repos_enabled  = false
    admin_features_enabled = false
    allowed_teams         = ["platform", "security"]
    beta_users           = []
  }
}

resource "github_repository" "premium_features" {
  count = var.access_controls.premium_repos_enabled ? 1 : 0
  name  = "premium-analytics"

  private = true
}

resource "github_team_repository" "premium_access" {
  for_each = var.access_controls.premium_repos_enabled ? 
    toset(var.access_controls.allowed_teams) : []

  team_id    = data.github_team.teams[each.key].id
  repository = github_repository.premium_features[0].name
  permission = "push"
}

resource "github_repository_collaborator" "beta_access" {
  for_each = toset(var.access_controls.beta_users)

  repository = github_repository.experimental_features.name
  username   = each.key
  permission = contains(var.access_controls.allowed_teams, 
    data.github_user.user[each.key].team) ? "admin" : "pull"
}

Permission Toggles typically:

Are very long-lived (often permanent)
Have complex rules (based on multiple attributes)
Affect access control (who can use what)

Modern Patterns: GitOps Integration

In 2024, feature toggles in infrastructure have evolved beyond simple conditionals. The integration with GitOps workflows through tools like ArgoCD and FluxCD has created new patterns for progressive infrastructure delivery.

[Diagram: Infrastructure Evolution]

Dynamic Configurations

As development progresses, the team realizes that simply toggling the feature on or off isn't granular enough for every use case they might have for feature flag deployment. They need to test different configurations of the PR Reminder system. They evolve their toggle into a more sophisticated configuration system:

variable "pr_reminder_config" {
  description = "Configuration for PR Reminder feature"
  type = object({
    enabled              = bool
    mode                = string # "off", "passive", "active", "aggressive"
    reminder_intervals  = list(string)
    escalation_enabled  = bool
    calendar_integration = bool
  })
  default = {
    enabled              = false
    mode                = "off"
    reminder_intervals  = []
    escalation_enabled  = false
    calendar_integration = false
  }
}

locals {
  pr_reminder_enabled = var.pr_reminder_config.enabled && var.pr_reminder_config.mode != "off"

  reminder_intervals = {
    passive    = ["24h", "72h"]
    active     = ["6h", "24h", "48h"]
    aggressive = ["2h", "6h", "12h", "24h"]
  }

  actual_intervals = local.pr_reminder_enabled ? 
    lookup(local.reminder_intervals, var.pr_reminder_config.mode, []) : []
}

# OpenTofu 1.7+ specific: Using encrypted state for sensitive config
# This feature is not available in Terraform
# data "aws_kms_key" "main" {
#   key_id = "alias/terraform-state"
# }

terraform {
  encryption {
    key_provider "aws_kms" "main" {
      kms_key_id = data.aws_kms_key.main.arn  # Use data source for KMS key
    }

    state {
      key_provider = aws_kms.main  # Correct syntax for state encryption
    }
  }
}

resource "aws_lambda_function" "pr_reminder" {
  count         = local.pr_reminder_enabled ? 1 : 0
  filename      = "pr_reminder.zip"
  function_name = "pr-reminder-${github_repository.team_repo.name}"
  role          = aws_iam_role.pr_reminder[count.index].arn  # Assumes IAM role is defined with same count
  handler       = "index.handler"
  runtime       = "nodejs18.x"

  # KMS encryption for Lambda environment variables containing sensitive data
  kms_key_arn = data.aws_kms_key.lambda.arn

  environment {
    variables = {
      # Use AWS Secrets Manager for sensitive values
      GITHUB_TOKEN          = data.aws_secretsmanager_secret_version.github_token.secret_string
      SLACK_WEBHOOK        = data.aws_secretsmanager_secret_version.slack_webhook.secret_string
      REMINDER_MODE        = var.pr_reminder_config.mode
      REMINDER_INTERVALS   = join(",", local.actual_intervals)
      ESCALATION_ENABLED   = var.pr_reminder_config.escalation_enabled
      CALENDAR_INTEGRATION = var.pr_reminder_config.calendar_integration
    }
  }
}

The Business Impact: Why This Matters

Before we dive deeper into implementation patterns, let's address the question every executive asks: "What's the real business impact?" Sarah's team tracked their metrics carefully, and the results were compelling.

After implementing feature toggles, they saw:

70% reduction in deployment times (from 4 hours to 1.2 hours)
85% fewer rollback incidents (from 2 per week to 1 per month)
60% faster feature delivery (PR Reminder shipped in 6 weeks instead of projected 15)
Estimated $200,000 annual savings from avoided downtime and faster recovery

But the real transformation wasn't just in the numbers. The development team's stress levels dropped dramatically. Instead of late-night emergency rollbacks, they had controlled, reversible deployments. Product teams got their features faster. And perhaps most importantly, the infrastructure team transformed from being seen as a bottleneck to being viewed as an enabler of business agility.

[Diagram: Infrastructure Evolution]

These aren't isolated results. Across the industry, organizations using feature toggles in infrastructure report similar improvements:

Preparing for Release: From Development to Production

After several weeks of development and testing, the PR Reminder feature is nearly ready. But Sarah's team has learned from past experiences—launching a feature to all users at once is a recipe for disaster. They need a gradual rollout strategy that minimizes risk while maximizing learning.

The team implements a sophisticated toggling strategy that allows them to control exactly who gets the feature and when:

variable "pr_reminder_rollout" {
  description = "Rollout configuration for PR Reminder"
  type = object({
    stage        = string # "disabled", "internal", "pilot", "general"
    repositories = list(string) # Specific repos for pilot
    percentage   = number # For percentage-based rollout
  })
  default = {
    stage        = "disabled"
    repositories = []
    percentage   = 0
  }
}

locals {
  # Define which repositories get the feature at each rollout stage
  pr_reminder_repos = {
    disabled = []
    internal = ["devops-tools", "infrastructure", "platform-core"]
    pilot    = concat(  # Combines internal repos with pilot repos
      local.pr_reminder_repos.internal,
      var.pr_reminder_rollout.repositories
    )
    general  = [] # Will be determined by percentage
  }

  # Complex logic to determine if a specific repo should have the feature
  # This uses two strategies:
  # 1. Explicit list checking for internal/pilot stages
  # 2. Hash-based percentage rollout for general stage
  should_enable_pr_reminder = contains(
    lookup(local.pr_reminder_repos, var.pr_reminder_rollout.stage, []),
    github_repository.team_repo.name
  ) || (
    var.pr_reminder_rollout.stage == "general" && 
    # This creates a deterministic random number from the repo name
    # ensuring the same repos always get the feature during percentage rollout
    parseint(substr(md5(github_repository.team_repo.name), 0, 8), 16) % 100 < var.pr_reminder_rollout.percentage
  )
}

resource "github_repository_webhook" "pr_reminder" {
  count      = local.should_enable_pr_reminder ? 1 : 0
  repository = github_repository.team_repo.name

  configuration {
    url          = aws_lambda_function_url.pr_reminder[0].function_url  # Assumes URL resource is defined
    content_type = "json"
    insecure_ssl = false
  }

  events = ["pull_request", "pull_request_review"]
}

The rollout plan is methodical:

Week 1: Internal testing with stage = "internal"—only the platform team's repositories get the feature
Week 2: Pilot phase with stage = "pilot"—friendly teams who volunteered for early access
Week 3-4: Gradual rollout with stage = "general" starting at 10% and increasing daily
Week 5: Full rollout at 100%, with the ability to instantly roll back if issues arise

This approach gives the team multiple opportunities to catch issues before they affect everyone. When they discover that the reminder intervals are too aggressive for some teams, they can adjust the configuration before the broader rollout.

Canary Releasing: Testing in Production Safely

Even with careful testing, Sarah's team knows that production always reveals surprises. They discovered this when the PR Reminder feature generated 500 Slack notifications in 10 minutes during a test—the notification logic didn't account for batch PR creation.

To prevent such issues from affecting all users, they implement a canary release strategy. The idea is simple but powerful: run two versions of the infrastructure simultaneously, with a small percentage of users on the new version:

variable "pr_reminder_canary" {
  description = "Canary configuration for PR Reminder"
  type = object({
    enabled     = bool
    version     = string # "stable" or "canary"
    canary_repos = list(string)
  })
  default = {
    enabled      = false
    version      = "stable"
    canary_repos = []
  }
}

# Terraform 1.7+ specific: Mock providers for testing
# Note: Mock providers are used with 'terraform test' command in Terraform 1.7+
# They are not defined inline in regular configuration files
# Example test configuration would be in a separate test file:
# tests/pr_reminder_test.tftest.hcl
# 
# run "test_pr_reminder" {
#   providers = {
#     aws = aws.mock
#   }
#   
#   variables {
#     pr_reminder_config = {
#       enabled = true
#       mode    = "active"
#     }
#   }
# }

resource "aws_lambda_function" "pr_reminder_stable" {
  count         = var.pr_reminder_config.enabled ? 1 : 0
  filename      = "pr_reminder_stable.zip"
  function_name = "pr-reminder-stable-${github_repository.team_repo.name}"
  # ... configuration ...
}

resource "aws_lambda_function" "pr_reminder_canary" {
  count         = var.pr_reminder_canary.enabled ? 1 : 0
  filename      = "pr_reminder_canary.zip"
  function_name = "pr-reminder-canary-${github_repository.team_repo.name}"
  # ... configuration with new features ...
}

resource "github_repository_webhook" "pr_reminder" {
  count      = local.should_enable_pr_reminder ? 1 : 0
  repository = github_repository.team_repo.name

  configuration {
    url = var.pr_reminder_canary.enabled && contains(var.pr_reminder_canary.canary_repos, github_repository.team_repo.name) ?
      aws_lambda_function_url.pr_reminder_canary[0].function_url :
      aws_lambda_function_url.pr_reminder_stable[0].function_url
    content_type = "json"
    insecure_ssl = false
  }

  events = ["pull_request", "pull_request_review"]
}

A/B Testing: Data-Driven Infrastructure Decisions

One of the most heated debates in Sarah's team was about reminder frequency. The backend team lead insisted that aggressive reminders (every 2 hours) would speed up PR reviews. The frontend team lead argued this would cause notification fatigue. Rather than endless meetings, Sarah proposed a solution: "Let's test it and let the data decide."

They implemented an A/B test across their repositories:

variable "pr_reminder_experiment" {
  description = "A/B test configuration for PR Reminder"
  type = object({
    enabled = bool
    variants = map(object({
      weight             = number
      reminder_intervals = list(string)
      escalation_hours  = number
    }))
  })
  default = {
    enabled = false
    variants = {
      control = {
        weight             = 50
        reminder_intervals = ["6h", "24h", "48h"]
        escalation_hours  = 72
      }
      aggressive = {
        weight             = 25
        reminder_intervals = ["2h", "6h", "12h"]
        escalation_hours  = 24
      }
      relaxed = {
        weight             = 25
        reminder_intervals = ["24h", "72h"]
        escalation_hours  = 120
      }
    }
  }
}

locals {
  # Deterministic assignment to variant based on repository name
  repo_hash = parseint(substr(md5(github_repository.team_repo.name), 0, 8), 16)
  variant_selection = local.repo_hash % 100

  selected_variant = var.pr_reminder_experiment.enabled ? (
    local.variant_selection < 50 ? "control" :
    local.variant_selection < 75 ? "aggressive" : "relaxed"
  ) : "control"

  variant_config = var.pr_reminder_experiment.variants[local.selected_variant]
}

resource "aws_lambda_function" "pr_reminder" {
  count = local.should_enable_pr_reminder ? 1 : 0
  # ... other configuration ...

  environment {
    variables = {
      EXPERIMENT_VARIANT   = local.selected_variant
      REMINDER_INTERVALS   = join(",", local.variant_config.reminder_intervals)
      ESCALATION_THRESHOLD = "${local.variant_config.escalation_hours}h"
      # Include variant in metrics for analysis
      METRICS_TAGS = jsonencode({
        variant = local.selected_variant
        repo    = github_repository.team_repo.name
      })
    }
  }
}

After running the experiment for a month, the results were eye-opening:

Backend teams with the "aggressive" variant had 40% faster PR merge times and reported higher satisfaction
Frontend teams with the "relaxed" variant had 15% better review quality scores and lower reviewer burnout
Overall, teams preferred different settings based on their workflow, not a one-size-fits-all approach

This data-driven approach ended the debate and led to a personalized configuration system where each team could choose their preferred reminder style.

Implementation Techniques

So far, we've seen how feature toggles helped Sarah's team navigate the complexity of releasing infrastructure incrementally. But as systems grow, the basic if/then toggling we've used can quickly lead to messy, hard-to-maintain infrastructure code.

The real challenge isn't just adding toggles—it's adding them in a way that remains maintainable as your infrastructure grows from dozens to hundreds of resources. This is where implementation patterns become crucial. Let's explore sophisticated patterns that keep your infrastructure code clean and manageable even as toggle complexity increases.

Toggle Points and Toggle Routers

In traditional software, we separate the toggle point (where the decision is made) from the toggle router (which makes the decision). The same principle applies to infrastructure, and for good reason. When you scatter toggle logic throughout your code, you end up with what I call "toggle spaghetti"—conditional statements everywhere, making it nearly impossible to understand what combinations of toggles are active or how they interact with each other.

The solution is architectural: separate the places where you check toggle states (toggle points) from the place where you decide what those states should be (toggle router). This separation provides several key benefits: it centralizes complex toggle logic in one place, makes testing toggle combinations manageable, and allows you to evolve toggle decision logic without touching every resource that uses it.

Think of the toggle router as your infrastructure's "decision headquarters." It receives raw toggle inputs—boolean flags, environment names, team identifiers—and produces clean, contextual decisions that resources can use without needing to understand the underlying complexity.

Here's how Sarah's team refactored their growing collection of toggles into a cleaner pattern:

# Toggle Router Module - centralizes all toggle logic
module "toggle_router" {
  source = "./modules/toggle-router"

  feature_flags = {
    pr_reminder     = var.enable_pr_reminder
    advanced_monitoring = var.enable_monitoring
    beta_features   = var.enable_beta
  }

  context = {
    environment = var.environment
    region      = var.aws_region
    team        = var.team_name
  }
}

# Toggle Points - resources simply use the decisions
resource "github_repository_webhook" "pr_reminder" {
  count = module.toggle_router.decisions.pr_reminder ? 1 : 0
  # ... configuration ...
}

The beauty of this pattern is that your resources don't need to know about the complex logic determining whether a feature should be enabled. They simply check the decision from the router. Here's what happens inside the router module:

# modules/toggle-router/main.tf
variable "feature_flags" {
  type = map(bool)
}

variable "context" {
  type = map(string)
}

locals {
  # Complex routing logic centralized here
  decisions = {
    pr_reminder = (
      var.feature_flags.pr_reminder && 
      var.context.environment != "production"
    ) || (
      var.feature_flags.pr_reminder && 
      var.context.environment == "production" && 
      contains(["platform", "devops"], var.context.team)
    )

    advanced_monitoring = (
      var.feature_flags.advanced_monitoring &&
      contains(["production", "staging"], var.context.environment)
    )

    beta_features = (
      var.feature_flags.beta_features &&
      var.context.environment == "development"
    )
  }
}

output "decisions" {
  value = local.decisions
}

Inversion of Control

For more complex scenarios, we can use Inversion of Control to inject different infrastructure configurations based on toggle state. This pattern moves beyond simple on/off toggles to completely swapping out entire infrastructure implementations.

The key insight here is that instead of having your main configuration choose between different resource configurations, you let the toggle system choose which module to use entirely. This approach works particularly well when you're evaluating fundamentally different architectural approaches.

For example, Sarah's team needed to test three different repository management strategies: a standard approach for most teams, an experimental approach with advanced features, and a beta approach for early adopters. Rather than toggling individual features, they used module selection:

# Define interface for repository configuration
variable "repository_config_module" {
  description = "Module path for repository configuration"
  type        = string
  default     = "./modules/standard-repo"
}

# Use dynamic module selection
module "selected_repo_config" {
  source = var.repository_config_module

  repo_name   = var.repository_name
  team_name   = var.team_name
  compliance  = var.compliance_requirements
}

# In terraform.tfvars for different environments:
# Development: repository_config_module = "./modules/experimental-repo"
# Production:  repository_config_module = "./modules/standard-repo"
# Beta:        repository_config_module = "./modules/beta-repo"

Each module implements the same interface but with different behavior:

Each module implements the same interface but with different behavior. The standard module provides basic features, while the experimental module includes advanced capabilities like wikis, projects, and sophisticated merge strategies. This separation keeps each approach clean and testable while avoiding the complexity of conditional logic scattered throughout your configuration.

Strategy Pattern

The Strategy pattern is one of the most elegant approaches to handling complex infrastructure variations. It's particularly valuable when you have multiple related settings that need to change together coherently.

The real power of this pattern emerged when Sarah's team needed to handle different operational scenarios. During normal operations, they wanted conservative scaling. During product launches, they needed balanced scaling. During Black Friday, they required aggressive scaling. Rather than toggling dozens of individual settings and hoping they were compatible, they defined complete strategies:

locals {
  scaling_strategies = {
    conservative = {
      min_size               = 2
      max_size               = 10
      target_cpu_utilization = 70
      scale_up_cooldown      = 300
      scale_down_cooldown    = 900
    }

    balanced = {
      min_size               = 3
      max_size               = 20
      target_cpu_utilization = 60
      scale_up_cooldown      = 180
      scale_down_cooldown    = 600
    }

    aggressive = {
      min_size               = 5
      max_size               = 50
      target_cpu_utilization = 50
      scale_up_cooldown      = 60
      scale_down_cooldown    = 300
    }
  }

  selected_strategy = local.scaling_strategies[var.scaling_strategy]
}

resource "aws_autoscaling_group" "app" {
  min_size         = local.selected_strategy.min_size
  max_size         = local.selected_strategy.max_size
  desired_capacity = local.selected_strategy.min_size

  # ... other configuration ...
}

resource "aws_autoscaling_policy" "cpu" {
  autoscaling_group_name = aws_autoscaling_group.app.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }

    target_value = local.selected_strategy.target_cpu_utilization
  }
}

Toggle Configuration

As Sarah's team discovered, managing toggle configuration becomes increasingly important as the number of toggles grows. After adding toggles for the PR Reminder, advanced monitoring, beta features, and several other capabilities, they found themselves struggling to keep track of which toggles were active in which environments.

Unlike application feature toggles that can change at runtime, infrastructure toggles often need to be more static due to the nature of infrastructure provisioning. However, this constraint actually forces us to think more carefully about toggle design, leading to more robust and maintainable solutions.

Let's explore three approaches to toggle configuration, each offering different trade-offs between simplicity and flexibility.

The fundamental challenge with infrastructure toggle configuration is the tension between flexibility and safety. You want toggles to be configurable enough to support different environments and use cases, but stable enough that infrastructure changes are predictable and auditable. The patterns we explore here represent different points on this spectrum, from simple static configuration to sophisticated dynamic systems.

Static Configuration

The simplest approach uses Terraform/OpenTofu variables. This method treats toggles as compile-time constants that are resolved when you run terraform plan or opentofu plan.

While this might seem limiting compared to runtime feature flags in applications, it has several advantages for infrastructure:

Changes are explicit and version-controlled
Toggle states are clearly documented in your tfvars files
All team members can see exactly what configuration is active for each environment
No external dependencies that could fail during deployment

# toggles.tfvars - simple, clear, version-controlled
enable_pr_reminder     = true
enable_beta_features   = false
scaling_strategy       = "balanced"
experiment_variant     = "control"

Static configuration works particularly well for release toggles and long-lived operational toggles where you don't need frequent changes.

Hierarchical Configuration

For larger organizations, hierarchical configuration allows for overrides at different levels. Sarah's team discovered this need when they started managing infrastructure for multiple teams, each with different requirements but sharing common patterns.

The challenge was clear: the platform team needed certain security toggles always enabled, the frontend team needed CDN features, and the data team needed different backup strategies. Creating separate toggle variables for every combination would have resulted in hundreds of variables.

Instead, they implemented a hierarchical system where more specific contexts override more general ones:

# Global defaults
variable "global_toggles" {
  type = map(bool)
  default = {
    enhanced_monitoring = true
    auto_scaling       = true
    public_access      = false
  }
}

# Environment overrides
variable "environment_toggles" {
  type = map(map(bool))
  default = {
    production = {
      public_access = true
      debug_mode    = false
    }
    staging = {
      debug_mode = true
    }
    development = {
      auto_scaling = false
      debug_mode   = true
    }
  }
}

# Team overrides
variable "team_toggles" {
  type = map(map(bool))
  default = {
    platform = {
      enhanced_monitoring = true
      experimental_features = true
    }
    frontend = {
      cdn_enabled = true
    }
  }
}

locals {
  # Merge configurations with precedence
  effective_toggles = merge(
    var.global_toggles,
    lookup(var.environment_toggles, var.environment, {}),
    lookup(var.team_toggles, var.team, {})
  )
}

Dynamic Configuration

Sometimes toggles need to change without infrastructure reprovisioning. Sarah's team discovered this during an incident where they needed to quickly disable auto-scaling across all environments. Waiting for a code change, review, and deployment would have taken too long.

Dynamic configuration bridges this gap by reading toggle states from external systems during plan time. While the infrastructure code remains static, the toggle values can be updated immediately:

# Read toggle configuration from AWS Systems Manager Parameter Store
data "aws_ssm_parameter" "feature_toggles" {
  name = "/infrastructure/toggles/${var.environment}"
}

locals {
  toggle_config = jsondecode(data.aws_ssm_parameter.feature_toggles.value)
}

# Use in resources
resource "aws_lambda_function" "processor" {
  count = local.toggle_config.lambda_processor_enabled ? 1 : 0
  # ... configuration ...

  environment {
    variables = {
      FEATURE_FLAGS = jsonencode(local.toggle_config)
    }
  }
}

Toggle Configuration Validation

As Sarah's team learned the hard way, it's crucial to validate toggle configurations to prevent invalid states. They once had an outage because someone enabled the PR Reminder feature while leaving the mode set to "off"—the Lambda functions were created but never received the correct configuration.

Terraform and OpenTofu provide built-in validation capabilities that catch these errors during planning, before they can affect your infrastructure:

variable "toggle_config" {
  type = object({
    pr_reminder_enabled = bool
    pr_reminder_mode   = string
    scaling_strategy   = string
    experiment_enabled = bool
    experiment_variant = string
  })

  validation {
    condition = contains(
      ["off", "passive", "active", "aggressive"],
      var.toggle_config.pr_reminder_mode
    )
    error_message = "Invalid PR reminder mode."
  }

  validation {
    condition = !(
      var.toggle_config.pr_reminder_enabled && 
      var.toggle_config.pr_reminder_mode == "off"
    )
    error_message = "PR reminder cannot be enabled with mode 'off'."
  }

  validation {
    condition = !(
      var.toggle_config.experiment_enabled &&
      var.toggle_config.experiment_variant == ""
    )
    error_message = "Experiment variant must be specified when experiment is enabled."
  }
}

Working with Feature-Flagged Infrastructure

After six months of using feature toggles, Sarah's team had learned valuable lessons about operating infrastructure with toggles. The patterns that worked in theory sometimes broke down in practice, and they had to develop new approaches to testing, monitoring, and maintenance.

Let's explore the practices they developed to manage their feature-flagged infrastructure effectively.

Testing Toggle Combinations

The combinatorial explosion of toggle states can make testing challenging. With just five boolean toggles, you have 32 possible combinations. Sarah's team learned this when their test suite started taking hours to run.

The solution wasn't to test everything—it was to test strategically. They identified three critical scenarios that covered 90% of their use cases:

# test/toggle_combinations.tf
locals {
  test_scenarios = [
    {
      name = "all_disabled"       # Baseline: everything off
      toggles = {
        pr_reminder = false
        monitoring  = false
        auto_scaling = false
      }
    },
    {
      name = "production_standard" # Typical production setup
      toggles = {
        pr_reminder = true
        monitoring  = true
        auto_scaling = true
      }
    },
    {
      name = "minimal_staging"     # Cost-optimized staging
      toggles = {
        pr_reminder = false
        monitoring  = true
        auto_scaling = false
      }
    }
  ]
}

# The 'for_each' construct creates multiple test environments in parallel
module "test_infrastructure" {
  for_each = { for s in local.test_scenarios : s.name => s }
  source   = "../modules/infrastructure"

  toggles = each.value.toggles
  environment = "test-${each.key}"
}

Toggle Debt and Lifecycle Management

Feature toggles in infrastructure can accumulate as "toggle debt." Sarah's team discovered this problem six months in, when they found 23 toggles in their codebase—12 of which nobody could remember the purpose of.

Unlike application code where you can just delete old flags, infrastructure toggles often control expensive resources. The team needed a systematic approach to lifecycle management:

# Document toggle lifecycle in code
variable "pr_reminder_toggle" {
  description = <<-EOT
    Controls PR Reminder feature rollout
    Created: 2024-01-15
    Owner: platform-team
    Expected removal: 2024-03-01
    Status: Active rollout in progress
  EOT
  type = bool
  default = false
}

# Automated toggle expiration checking
locals {
  toggle_metadata = {
    pr_reminder = {
      created = "2024-01-15"
      expires = "2024-03-01"
      owner   = "platform-team"
    }
    legacy_monitoring = {
      created = "2023-06-01"
      expires = "2023-09-01"  # Overdue!
      owner   = "sre-team"
    }
  }

  expired_toggles = [
    for name, meta in local.toggle_metadata :
    name if timestamp() > timeadd(meta.expires, "0s")
  ]
}

# This resource will cause the Terraform/OpenTofu plan to fail if expired toggles exist
# forcing the team to either remove the toggle or extend its lifetime with justification
resource "null_resource" "check_toggle_expiration" {
  count = length(local.expired_toggles) > 0 ? 1 : 0

  provisioner "local-exec" {
    command = "echo 'ERROR: Expired toggles found: ${join(", ", local.expired_toggles)}' && exit 1"
  }
}

This approach transformed toggle cleanup from a manual chore to an automated gate. When the PR Reminder toggle hit its expiration date, the team had to make an explicit decision: remove it (because the feature was stable) or extend it with a documented reason.

Monitoring and Observability

Infrastructure toggles need proper monitoring. Sarah's team learned this during an incident where a misconfigured toggle increased their AWS bill by $30,000 in one incident over a single weekend. The toggle had enabled expensive GPU instances in all regions, but nobody noticed until the billing alert fired.

After that expensive lesson, they built comprehensive monitoring at three levels:

Configuration Level: A CloudWatch dashboard showing current toggle states
Impact Level: Metrics tracking how toggles affect costs and performance
Operational Level: Alerts when toggle-controlled resources misbehave

# Example: Dashboard for at-a-glance toggle visibility
resource "aws_cloudwatch_dashboard" "toggle_monitoring" {
  dashboard_name = "infrastructure-toggles"

  dashboard_body = jsonencode({
    widgets = [
      {
        type = "text"
        properties = {
          markdown = "## Current Toggle States\n\n| Toggle | State | Environment |"
        }
      },
      {
        type = "metric"
        properties = {
          metrics = [["Custom/Toggles", "ToggleUsage"]]
          title  = "Toggle-Controlled Feature Usage"
        }
      }
    ]
  })
}

This comprehensive monitoring caught issues early. When a developer accidentally enabled expensive GPU instances through a toggle, the cost alert fired within hours instead of waiting for the monthly bill.

Toggle Governance

As more teams started using toggles, Sarah realized they needed governance to prevent chaos. Different teams were using different naming conventions, creating toggles without documentation, and worst of all, creating conflicting toggles that interfered with each other.

The solution was to embed governance rules directly into the infrastructure code:

# Define toggle governance rules
module "toggle_governance" {
  source = "./modules/governance"

  rules = {
    max_toggles_per_module = 5
    max_toggle_age_days    = 90
    required_approvers     = 2

    naming_convention = "^(release|experiment|ops|permission)_[a-z_]+$"

    required_tags = [
      "owner",
      "created_date",
      "expected_removal",
      "category"
    ]
  }

  current_toggles = {
    release_pr_reminder = {
      owner            = "platform-team"
      created_date     = "2024-01-15"
      expected_removal = "2024-03-01"
      category         = "release"
    }

    ops_scaling_override = {
      owner            = "sre-team"
      created_date     = "2024-01-01"
      expected_removal = "permanent"
      category         = "ops"
    }
  }
}

# Governance module validates and reports
output "governance_report" {
  value = module.toggle_governance.validation_report
}

Conclusion: From Theory to Practice

Remember Sarah's team and their PR Reminder feature? What started as a complex challenge—delivering stable infrastructure while continuing development on experimental features—became a journey of discovery about how feature toggles transform infrastructure management.

Feature toggles in Infrastructure as Code represent a fundamental shift in how we think about infrastructure management. No longer are we constrained by the binary nature of traditional infrastructure deployments—where resources either exist or they don't, where configurations are either active or they're not. The patterns demonstrated through our PR Reminder story—from simple boolean flags to sophisticated rollout strategies—show how infrastructure can evolve to match the flexibility we've come to expect from application deployments.

This evolution isn't just about technical capability; it's about changing the risk profile of infrastructure changes. Traditional infrastructure deployment is high-stakes: you're committing to a configuration before you know how it will behave in production. Feature toggles transform this into a low-stakes decision: you can deploy infrastructure changes while keeping the option to quickly revert or modify behavior based on real-world feedback.

The journey from our initial simple toggle—count = var.enable_pr_reminder ? 1 : 0—to the sophisticated rollout strategies, monitoring systems, and governance frameworks we explored demonstrates how feature toggles grow with your organizational needs. They start simple and can remain simple if that's all you need. But when your infrastructure becomes critical to business operations, they can evolve to provide the safety, observability, and control mechanisms that enterprise-scale infrastructure requires.

Real-World Success Stories

The impact of feature toggles extends far beyond Sarah's team. Organizations across industries are seeing transformative results:

Healthcare Software Provider:

Deployment time: 4.5 hours → 1.5 hours (70% reduction)
Failed deployments: 15% → 3% (80% reduction)
Monthly infrastructure costs: $45,000 → $38,000 (15% savings)

Financial Services Company (Multi-Cloud Migration):

Migration timeline: 18 months → 6 months
Outages during migration: 0
Cost optimization: 30% reduction through multi-cloud arbitrage
Disaster recovery time: 4 hours → 30 minutes

E-Commerce Platform (Black Friday 2023):

Peak traffic handled: 10x normal load
Infrastructure cost during event: +250% (vs +600% previous year)
Response time during peak: 250ms (vs 2000ms previous year)
Revenue: +45% year-over-year

These aren't edge cases—they're becoming the norm for organizations that embrace feature toggles in their infrastructure.

The 2023 fork that created OpenTofu has accelerated innovation in this space. While Terraform focuses on enterprise features like mock providers and native testing, OpenTofu pushes boundaries with state encryption and provider-defined functions. Both tools now offer robust support for feature toggles, though with different approaches.

Key Lessons for Implementing Feature Toggles in Infrastructure

Start Simple: Begin with basic boolean toggles and evolve as needed
Categorize Appropriately: Use the right type of toggle for your use case
Manage Lifecycle: Track creation and removal of toggles to prevent debt
Centralize Decision Logic: Use toggle routers to avoid scattered conditionals
Monitor Everything: Track toggle states and their impact on infrastructure
Plan for Testing: Consider the combinatorial explosion of toggle states
Implement Governance: Establish rules and processes for toggle management
Embrace GitOps: Integrate with modern deployment pipelines
Measure Impact: Track metrics to prove value
Clean Up Regularly: Remove toggles that have served their purpose

The Future: Infrastructure as Gradually Mutable

As we look ahead, the convergence of GitOps, feature toggles, and Infrastructure as Code points to a future where infrastructure is no longer immutable but gradually mutable. Tools like ArgoCD and FluxCD, combined with progressive delivery systems like Flagger, are making it possible to apply the same sophisticated deployment strategies to infrastructure that we use for applications.

The PR Reminder feature that seemed so complex at the start of our tale would be routine in this future—infrastructure that adapts based on real-time metrics, user feedback, and business requirements, all while maintaining the safety and auditability that Infrastructure as Code provides.

Back to Sarah's Team

Six months after implementing feature toggles, Sarah's team has transformed how they deliver infrastructure. The PR Reminder feature that once threatened to derail their landing zone deployment is now smoothly running across 80% of the organization's repositories. Teams can choose their preferred reminder frequency based on A/B test results. The canary release strategy caught three critical bugs before they affected the wider organization. And most importantly, the platform team is no longer seen as a bottleneck—they're enablers of innovation.

The journey wasn't without challenges. They had to clean up toggle debt, implement governance, and build monitoring systems. But the investment paid off: zero infrastructure-related outages in the last quarter, 70% faster feature delivery, and a team that sleeps better at night knowing they can quickly respond to any issue.

Feature toggles transform Infrastructure as Code from static definitions into dynamic, adaptable systems. They enable practices like canary deployments, A/B testing, and gradual rollouts that were once the exclusive domain of application code. However, with this power comes responsibility—every toggle adds complexity that must be managed.

Remember: feature toggles in infrastructure are not just about controlling what gets deployed—they're about enabling safer, more confident infrastructure evolution. Use them wisely, manage them carefully, and remove them promptly when their purpose is served.

The techniques and patterns described in this article work with both Terraform and OpenTofu, though some advanced features are tool-specific as noted. The principles remain the same whether you're managing AWS resources, Kubernetes configurations, or any other infrastructure components.

For code examples and implementation patterns, visit: github.com/example/infrastructure-feature-toggles

Command Palette

Comments

More from this blog

Feature Toggles in Infrastructure as Code

A Toggling Tale

The Initial Implementation

Enter Feature Toggles

Categories of Toggles

Release Toggles

Experiment Toggles

Ops Toggles

Permission Toggles

Modern Patterns: GitOps Integration

Dynamic Configurations

The Business Impact: Why This Matters

Preparing for Release: From Development to Production

Canary Releasing: Testing in Production Safely

A/B Testing: Data-Driven Infrastructure Decisions

Implementation Techniques

Toggle Points and Toggle Routers

Inversion of Control

Strategy Pattern

Toggle Configuration

Static Configuration

Hierarchical Configuration

Dynamic Configuration

Toggle Configuration Validation

Working with Feature-Flagged Infrastructure

Testing Toggle Combinations

Toggle Debt and Lifecycle Management

Monitoring and Observability

Toggle Governance

Conclusion: From Theory to Practice

Real-World Success Stories

Key Lessons for Implementing Feature Toggles in Infrastructure

The Future: Infrastructure as Gradually Mutable

Back to Sarah's Team