Feature Toggles in Infrastructure as Code
Discover how feature toggles enhance Infrastructure as Code deployments. Explore practical examples with Terraform and OpenTofu for faster deployments.

Feature Toggles in Infrastructure as Code
Feature toggles (also called feature flags) are a powerful technique, allowing teams to modify infrastructure behavior without changing code. They become particularly valuable when managing Infrastructure as Code (IaC) with tools like OpenTofu and Terraform, both of which provide a rich ecosystem of features to implement feature toggles for your IaC projects.
Note: This article covers patterns that work with both OpenTofu and Terraform, highlighting differences where they exist.
[Diagram: Infrastructure Evolution]
A Toggling Tale
Here's a common scenario: You're on a platform team at a mid-sized fintech company that has been tasked with creating a comprehensive GitHub landing zone—a standardized, secure foundation for all your organization's repositories. Your users, various product development teams at the company, are particularly excited about one feature: an intelligent Pull Request Reminder system that will automatically notify the right reviewers at the right time, escalate stale PRs, and even integrate with your organization's calendars to find the perfect review windows.
Sarah, your team lead, enthusiastically dubs this the "PR Reminder" feature. It sounds simple enough at first, but as the team digs in, they realize it involves complex logic including timezone calculations, team availability patterns, and integration with multiple external systems. The feature requires Infrastructure as Code configurations that interact with the GitHub provider, AWS Lambda functions for the notification logic, and EventBridge rules for scheduling.
The challenge here isn't just technical—it's organizational. Your team needs to deliver value quickly to the product teams while managing the risk of an experimental feature. This tension between speed and safety is where feature toggles become invaluable.
The Initial Implementation
As an Infrastructure as Code developer on the team, you start with a straightforward approach, and branch off from main and begin defining the PR Reminder infrastructure into the codebase:
# Works in both Terraform and OpenTofu
resource "github_repository" "team_repo" {
name = "payment-service"
description = "Payment processing microservice"
visibility = "private" # Fintech repos should be private for security
template {
owner = "github"
repository = "terraform-template-module"
include_all_branches = true
}
pages {
source {
branch = "master"
path = "/docs"
}
}
}
# PR Reminder feature - still working on this
# Nearly there...
# Note: Define the Lambda function URL resource
# resource "aws_lambda_function_url" "pr_reminder" {
# function_name = aws_lambda_function.pr_reminder.function_name
# authorization_type = "AWS_IAM"
# }
resource "github_repository_webhook" "pr_reminder" {
repository = github_repository.team_repo.name
configuration {
url = aws_lambda_function_url.pr_reminder.function_url # Assumes URL resource is defined
content_type = "json"
insecure_ssl = false
}
events = ["pull_request", "pull_request_review"]
}
# Note: Define the IAM role with appropriate Lambda execution permissions
# resource "aws_iam_role" "pr_reminder" {
# name = "pr-reminder-lambda-role"
# assume_role_policy = jsonencode({
# Version = "2012-10-17"
# Statement = [{
# Action = "sts:AssumeRole"
# Principal = { Service = "lambda.amazonaws.com" }
# Effect = "Allow"
# }]
# })
# }
# Data sources for secure credential retrieval
# data "aws_secretsmanager_secret_version" "github_token" {
# secret_id = "github-token"
# }
# data "aws_secretsmanager_secret_version" "slack_webhook" {
# secret_id = "slack-webhook-url"
# }
resource "aws_lambda_function" "pr_reminder" {
filename = "pr_reminder.zip"
function_name = "pr-reminder-${github_repository.team_repo.name}"
role = aws_iam_role.pr_reminder.arn # Assumes IAM role is defined
handler = "index.handler"
runtime = "nodejs18.x"
# Note: Consider adding KMS encryption for environment variables
# kms_key_arn = data.aws_kms_key.lambda.arn
environment {
variables = {
# Use AWS Secrets Manager for sensitive values
GITHUB_TOKEN = data.aws_secretsmanager_secret_version.github_token.secret_string
SLACK_WEBHOOK = data.aws_secretsmanager_secret_version.slack_webhook.secret_string
# Complex configuration for reminder logic
REMINDER_INTERVALS = "2h,6h,24h,48h"
ESCALATION_THRESHOLD = "72h"
}
}
}
After a few weeks of development, the new feature is partially working but far from complete. The timezone logic is buggy, the EventBridge rules are not firing off consistently, and the integration with the corporate calendar hasn't even been started. Meanwhile, the product teams are urgently requesting the need for even a basic GitHub landing zone to be deployed.
Enter Feature Toggles
Sarah, your manager, realizes your team needs to figure out how to get the stable parts of the landing zone delivered while keeping the experimental PR Reminder feature hidden. She introduces the idea of using a feature toggle—a boolean variable that, when true, would enable a particular resource, and when false would not.
This approach solves the immediate problem: the team can deploy their infrastructure code to production with the PR Reminder feature safely hidden behind a toggle. Product teams get their landing zone immediately, while development continues on the complex reminder system. Here's how the implementation looks:
variable "enable_pr_reminder" {
description = "Enable the experimental PR Reminder feature"
type = bool
default = false
}
# Note: Define the Lambda function URL resource
# resource "aws_lambda_function_url" "pr_reminder" {
# count = var.enable_pr_reminder ? 1 : 0
# function_name = aws_lambda_function.pr_reminder[0].function_name
# authorization_type = "AWS_IAM"
# }
resource "github_repository_webhook" "pr_reminder" {
count = var.enable_pr_reminder ? 1 : 0
repository = github_repository.team_repo.name
configuration {
url = aws_lambda_function_url.pr_reminder[0].function_url # Assumes URL resource is defined
content_type = "json"
insecure_ssl = false
}
events = ["pull_request", "pull_request_review"]
}
# Note: Define the IAM role with appropriate Lambda execution permissions
# resource "aws_iam_role" "pr_reminder" {
# count = var.enable_pr_reminder ? 1 : 0
# name = "pr-reminder-lambda-role-${count.index}"
# assume_role_policy = jsonencode({
# Version = "2012-10-17"
# Statement = [{
# Action = "sts:AssumeRole"
# Principal = { Service = "lambda.amazonaws.com" }
# Effect = "Allow"
# }]
# })
# }
resource "aws_lambda_function" "pr_reminder" {
count = var.enable_pr_reminder ? 1 : 0
filename = "pr_reminder.zip"
function_name = "pr-reminder-${github_repository.team_repo.name}"
role = aws_iam_role.pr_reminder[count.index].arn # Assumes IAM role is defined with same count
handler = "index.handler"
runtime = "nodejs18.x"
# KMS encryption for Lambda environment variables containing sensitive data
kms_key_arn = data.aws_kms_key.lambda.arn
environment {
variables = {
# Use AWS Secrets Manager for sensitive values
GITHUB_TOKEN = data.aws_secretsmanager_secret_version.github_token.secret_string
SLACK_WEBHOOK = data.aws_secretsmanager_secret_version.slack_webhook.secret_string
REMINDER_INTERVALS = "2h,6h,24h,48h"
ESCALATION_THRESHOLD = "72h"
}
}
}
While simple in its approach, using this conditional variable along with Terraform's (or OpenTofu's) count parameter allows the team to speed up the release of stable landing zone features to production without the fear of the fragile PR reminder feature failing at a critical time. Additionally, developers only need to set a single variable to true in order to turn the feature back on in their development environment—no need for duplicating codebases.
In feature toggle terminology, this conditional boolean variable would be referred to as a "release toggle," one of the four types of toggles defined in feature toggle development. But why does this distinction matter? Understanding the different categories of toggles helps you choose the right approach for your specific use case and manage the lifecycle of each toggle appropriately.
Categories of Toggles
[Diagram: Infrastructure Evolution]
Release Toggles
Release Toggles allow teams to separate deployment of infrastructure code from the release of infrastructure features. They're particularly valuable in Infrastructure as Code because infrastructure changes can be high-risk and difficult to roll back quickly.
In our PR Reminder example, the initial enable_pr_reminder variable was a classic Release Toggle:
variable "enable_pr_reminder" {
description = "Enable the experimental PR Reminder feature"
type = bool
default = false
}
resource "github_repository_webhook" "pr_reminder" {
count = var.enable_pr_reminder ? 1 : 0
# ... configuration ...
}
Release Toggles in infrastructure are typically:
Short-lived in terms of longevity (days to weeks)
Binary in nature (on/off with no gradation or nuance)
Removed after release (cleaned up once the feature is stable)
A more common example of a release toggle might be toggling a new automated backup system:
variable "enable_new_backup_system" {
description = "Enable the new S3-based backup system"
type = bool
default = false
}
# Note: Define the backup vault resource
# resource "aws_backup_vault" "main" {
# count = var.enable_new_backup_system ? 1 : 0
# name = "main-backup-vault"
# }
resource "aws_backup_plan" "new_system" {
count = var.enable_new_backup_system ? 1 : 0
name = "automated-backup-plan"
rule {
rule_name = "daily_backups"
target_vault_name = aws_backup_vault.main[0].name # Assumes vault resource is defined
schedule = "cron(0 5 ? * * *)"
lifecycle {
delete_after = 30
}
}
}
Until the enable_new_backup_system variable is set to true, the new aws_backup_plan is deployed with the Infrastructure as Code, but the feature is not enabled until the toggle is set to true
Experiment Toggles
Experiment Toggles facilitate A/B testing of infrastructure configurations. They're used to gather data about different infrastructure approaches and make data-driven decisions about the best configuration.
Our PR Reminder A/B test exemplifies this pattern:
variable "database_performance_experiment" {
description = "A/B test for database performance settings"
type = string
default = "control"
validation {
condition = contains(["control", "high_iops", "high_memory"], var.database_performance_experiment)
error_message = "Must be control, high_iops, or high_memory"
}
}
resource "aws_db_instance" "application_db" {
identifier = "app-database"
# Experiment with different instance classes
instance_class = {
control = "db.t3.medium"
high_iops = "db.m5.large"
high_memory = "db.r5.large"
}[var.database_performance_experiment]
# Experiment with storage configurations
allocated_storage = var.database_performance_experiment == "high_iops" ? 200 : 100
iops = var.database_performance_experiment == "high_iops" ? 3000 : null
tags = {
Experiment = var.database_performance_experiment
Purpose = "performance-testing"
}
}
Experiment Toggles typically:
Have multiple states (not just on/off)
Include measurement (tagged for metrics collection)
Are time-bounded (removed after statistical significance is reached)
Ops Toggles
Ops Toggles provide operational control over infrastructure behavior, acting as circuit breakers or kill switches for infrastructure features. They allow operations teams to respond quickly to incidents without code changes.
variable "ops_controls" {
description = "Operational control flags"
type = object({
enable_auto_scaling = bool
enable_public_access = bool
maintenance_mode = bool
rate_limit_multiplier = number
})
default = {
enable_auto_scaling = true
enable_public_access = true
maintenance_mode = false
rate_limit_multiplier = 1.0
}
}
resource "aws_autoscaling_group" "web_tier" {
count = var.ops_controls.enable_auto_scaling ? 1 : 0
min_size = var.ops_controls.maintenance_mode ? 1 : 3
max_size = var.ops_controls.maintenance_mode ? 2 : 20
desired_capacity = var.ops_controls.maintenance_mode ? 1 : 6
# ... other configuration ...
}
resource "aws_security_group_rule" "public_https" {
count = var.ops_controls.enable_public_access && !var.ops_controls.maintenance_mode ? 1 : 0
type = "ingress"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
security_group_id = aws_security_group.web.id
}
resource "aws_api_gateway_usage_plan" "api_limit" {
name = "standard-limits"
throttle_settings {
rate_limit = 1000 * var.ops_controls.rate_limit_multiplier
burst_limit = 2000 * var.ops_controls.rate_limit_multiplier
}
}
Ops Toggles are characterized by:
Long-lived (may exist for months or permanently)
Runtime modifiable (can be changed without deployment)
Incident-response focused (designed for operational needs)
Permission Toggles
Permission Toggles control access to infrastructure resources based on user attributes, team membership, or other criteria. They enable gradual rollout of infrastructure access and premium features.
variable "access_controls" {
description = "Permission-based access controls"
type = object({
premium_repos_enabled = bool
admin_features_enabled = bool
allowed_teams = list(string)
beta_users = list(string)
})
default = {
premium_repos_enabled = false
admin_features_enabled = false
allowed_teams = ["platform", "security"]
beta_users = []
}
}
resource "github_repository" "premium_features" {
count = var.access_controls.premium_repos_enabled ? 1 : 0
name = "premium-analytics"
private = true
}
resource "github_team_repository" "premium_access" {
for_each = var.access_controls.premium_repos_enabled ?
toset(var.access_controls.allowed_teams) : []
team_id = data.github_team.teams[each.key].id
repository = github_repository.premium_features[0].name
permission = "push"
}
resource "github_repository_collaborator" "beta_access" {
for_each = toset(var.access_controls.beta_users)
repository = github_repository.experimental_features.name
username = each.key
permission = contains(var.access_controls.allowed_teams,
data.github_user.user[each.key].team) ? "admin" : "pull"
}
Permission Toggles typically:
Are very long-lived (often permanent)
Have complex rules (based on multiple attributes)
Affect access control (who can use what)
Modern Patterns: GitOps Integration
In 2024, feature toggles in infrastructure have evolved beyond simple conditionals. The integration with GitOps workflows through tools like ArgoCD and FluxCD has created new patterns for progressive infrastructure delivery.
[Diagram: Infrastructure Evolution]
Dynamic Configurations
As development progresses, the team realizes that simply toggling the feature on or off isn't granular enough for every use case they might have for feature flag deployment. They need to test different configurations of the PR Reminder system. They evolve their toggle into a more sophisticated configuration system:
variable "pr_reminder_config" {
description = "Configuration for PR Reminder feature"
type = object({
enabled = bool
mode = string # "off", "passive", "active", "aggressive"
reminder_intervals = list(string)
escalation_enabled = bool
calendar_integration = bool
})
default = {
enabled = false
mode = "off"
reminder_intervals = []
escalation_enabled = false
calendar_integration = false
}
}
locals {
pr_reminder_enabled = var.pr_reminder_config.enabled && var.pr_reminder_config.mode != "off"
reminder_intervals = {
passive = ["24h", "72h"]
active = ["6h", "24h", "48h"]
aggressive = ["2h", "6h", "12h", "24h"]
}
actual_intervals = local.pr_reminder_enabled ?
lookup(local.reminder_intervals, var.pr_reminder_config.mode, []) : []
}
# OpenTofu 1.7+ specific: Using encrypted state for sensitive config
# This feature is not available in Terraform
# data "aws_kms_key" "main" {
# key_id = "alias/terraform-state"
# }
terraform {
encryption {
key_provider "aws_kms" "main" {
kms_key_id = data.aws_kms_key.main.arn # Use data source for KMS key
}
state {
key_provider = aws_kms.main # Correct syntax for state encryption
}
}
}
resource "aws_lambda_function" "pr_reminder" {
count = local.pr_reminder_enabled ? 1 : 0
filename = "pr_reminder.zip"
function_name = "pr-reminder-${github_repository.team_repo.name}"
role = aws_iam_role.pr_reminder[count.index].arn # Assumes IAM role is defined with same count
handler = "index.handler"
runtime = "nodejs18.x"
# KMS encryption for Lambda environment variables containing sensitive data
kms_key_arn = data.aws_kms_key.lambda.arn
environment {
variables = {
# Use AWS Secrets Manager for sensitive values
GITHUB_TOKEN = data.aws_secretsmanager_secret_version.github_token.secret_string
SLACK_WEBHOOK = data.aws_secretsmanager_secret_version.slack_webhook.secret_string
REMINDER_MODE = var.pr_reminder_config.mode
REMINDER_INTERVALS = join(",", local.actual_intervals)
ESCALATION_ENABLED = var.pr_reminder_config.escalation_enabled
CALENDAR_INTEGRATION = var.pr_reminder_config.calendar_integration
}
}
}
The Business Impact: Why This Matters
Before we dive deeper into implementation patterns, let's address the question every executive asks: "What's the real business impact?" Sarah's team tracked their metrics carefully, and the results were compelling.
After implementing feature toggles, they saw:
70% reduction in deployment times (from 4 hours to 1.2 hours)
85% fewer rollback incidents (from 2 per week to 1 per month)
60% faster feature delivery (PR Reminder shipped in 6 weeks instead of projected 15)
Estimated $200,000 annual savings from avoided downtime and faster recovery
But the real transformation wasn't just in the numbers. The development team's stress levels dropped dramatically. Instead of late-night emergency rollbacks, they had controlled, reversible deployments. Product teams got their features faster. And perhaps most importantly, the infrastructure team transformed from being seen as a bottleneck to being viewed as an enabler of business agility.
[Diagram: Infrastructure Evolution]
These aren't isolated results. Across the industry, organizations using feature toggles in infrastructure report similar improvements:
Preparing for Release: From Development to Production
After several weeks of development and testing, the PR Reminder feature is nearly ready. But Sarah's team has learned from past experiences—launching a feature to all users at once is a recipe for disaster. They need a gradual rollout strategy that minimizes risk while maximizing learning.
The team implements a sophisticated toggling strategy that allows them to control exactly who gets the feature and when:
variable "pr_reminder_rollout" {
description = "Rollout configuration for PR Reminder"
type = object({
stage = string # "disabled", "internal", "pilot", "general"
repositories = list(string) # Specific repos for pilot
percentage = number # For percentage-based rollout
})
default = {
stage = "disabled"
repositories = []
percentage = 0
}
}
locals {
# Define which repositories get the feature at each rollout stage
pr_reminder_repos = {
disabled = []
internal = ["devops-tools", "infrastructure", "platform-core"]
pilot = concat( # Combines internal repos with pilot repos
local.pr_reminder_repos.internal,
var.pr_reminder_rollout.repositories
)
general = [] # Will be determined by percentage
}
# Complex logic to determine if a specific repo should have the feature
# This uses two strategies:
# 1. Explicit list checking for internal/pilot stages
# 2. Hash-based percentage rollout for general stage
should_enable_pr_reminder = contains(
lookup(local.pr_reminder_repos, var.pr_reminder_rollout.stage, []),
github_repository.team_repo.name
) || (
var.pr_reminder_rollout.stage == "general" &&
# This creates a deterministic random number from the repo name
# ensuring the same repos always get the feature during percentage rollout
parseint(substr(md5(github_repository.team_repo.name), 0, 8), 16) % 100 < var.pr_reminder_rollout.percentage
)
}
resource "github_repository_webhook" "pr_reminder" {
count = local.should_enable_pr_reminder ? 1 : 0
repository = github_repository.team_repo.name
configuration {
url = aws_lambda_function_url.pr_reminder[0].function_url # Assumes URL resource is defined
content_type = "json"
insecure_ssl = false
}
events = ["pull_request", "pull_request_review"]
}
The rollout plan is methodical:
Week 1: Internal testing with
stage = "internal"—only the platform team's repositories get the featureWeek 2: Pilot phase with
stage = "pilot"—friendly teams who volunteered for early accessWeek 3-4: Gradual rollout with
stage = "general"starting at 10% and increasing dailyWeek 5: Full rollout at 100%, with the ability to instantly roll back if issues arise
This approach gives the team multiple opportunities to catch issues before they affect everyone. When they discover that the reminder intervals are too aggressive for some teams, they can adjust the configuration before the broader rollout.
Canary Releasing: Testing in Production Safely
Even with careful testing, Sarah's team knows that production always reveals surprises. They discovered this when the PR Reminder feature generated 500 Slack notifications in 10 minutes during a test—the notification logic didn't account for batch PR creation.
To prevent such issues from affecting all users, they implement a canary release strategy. The idea is simple but powerful: run two versions of the infrastructure simultaneously, with a small percentage of users on the new version:
variable "pr_reminder_canary" {
description = "Canary configuration for PR Reminder"
type = object({
enabled = bool
version = string # "stable" or "canary"
canary_repos = list(string)
})
default = {
enabled = false
version = "stable"
canary_repos = []
}
}
# Terraform 1.7+ specific: Mock providers for testing
# Note: Mock providers are used with 'terraform test' command in Terraform 1.7+
# They are not defined inline in regular configuration files
# Example test configuration would be in a separate test file:
# tests/pr_reminder_test.tftest.hcl
#
# run "test_pr_reminder" {
# providers = {
# aws = aws.mock
# }
#
# variables {
# pr_reminder_config = {
# enabled = true
# mode = "active"
# }
# }
# }
resource "aws_lambda_function" "pr_reminder_stable" {
count = var.pr_reminder_config.enabled ? 1 : 0
filename = "pr_reminder_stable.zip"
function_name = "pr-reminder-stable-${github_repository.team_repo.name}"
# ... configuration ...
}
resource "aws_lambda_function" "pr_reminder_canary" {
count = var.pr_reminder_canary.enabled ? 1 : 0
filename = "pr_reminder_canary.zip"
function_name = "pr-reminder-canary-${github_repository.team_repo.name}"
# ... configuration with new features ...
}
resource "github_repository_webhook" "pr_reminder" {
count = local.should_enable_pr_reminder ? 1 : 0
repository = github_repository.team_repo.name
configuration {
url = var.pr_reminder_canary.enabled && contains(var.pr_reminder_canary.canary_repos, github_repository.team_repo.name) ?
aws_lambda_function_url.pr_reminder_canary[0].function_url :
aws_lambda_function_url.pr_reminder_stable[0].function_url
content_type = "json"
insecure_ssl = false
}
events = ["pull_request", "pull_request_review"]
}
A/B Testing: Data-Driven Infrastructure Decisions
One of the most heated debates in Sarah's team was about reminder frequency. The backend team lead insisted that aggressive reminders (every 2 hours) would speed up PR reviews. The frontend team lead argued this would cause notification fatigue. Rather than endless meetings, Sarah proposed a solution: "Let's test it and let the data decide."
They implemented an A/B test across their repositories:
variable "pr_reminder_experiment" {
description = "A/B test configuration for PR Reminder"
type = object({
enabled = bool
variants = map(object({
weight = number
reminder_intervals = list(string)
escalation_hours = number
}))
})
default = {
enabled = false
variants = {
control = {
weight = 50
reminder_intervals = ["6h", "24h", "48h"]
escalation_hours = 72
}
aggressive = {
weight = 25
reminder_intervals = ["2h", "6h", "12h"]
escalation_hours = 24
}
relaxed = {
weight = 25
reminder_intervals = ["24h", "72h"]
escalation_hours = 120
}
}
}
}
locals {
# Deterministic assignment to variant based on repository name
repo_hash = parseint(substr(md5(github_repository.team_repo.name), 0, 8), 16)
variant_selection = local.repo_hash % 100
selected_variant = var.pr_reminder_experiment.enabled ? (
local.variant_selection < 50 ? "control" :
local.variant_selection < 75 ? "aggressive" : "relaxed"
) : "control"
variant_config = var.pr_reminder_experiment.variants[local.selected_variant]
}
resource "aws_lambda_function" "pr_reminder" {
count = local.should_enable_pr_reminder ? 1 : 0
# ... other configuration ...
environment {
variables = {
EXPERIMENT_VARIANT = local.selected_variant
REMINDER_INTERVALS = join(",", local.variant_config.reminder_intervals)
ESCALATION_THRESHOLD = "${local.variant_config.escalation_hours}h"
# Include variant in metrics for analysis
METRICS_TAGS = jsonencode({
variant = local.selected_variant
repo = github_repository.team_repo.name
})
}
}
}
After running the experiment for a month, the results were eye-opening:
Backend teams with the "aggressive" variant had 40% faster PR merge times and reported higher satisfaction
Frontend teams with the "relaxed" variant had 15% better review quality scores and lower reviewer burnout
Overall, teams preferred different settings based on their workflow, not a one-size-fits-all approach
This data-driven approach ended the debate and led to a personalized configuration system where each team could choose their preferred reminder style.
Implementation Techniques
So far, we've seen how feature toggles helped Sarah's team navigate the complexity of releasing infrastructure incrementally. But as systems grow, the basic if/then toggling we've used can quickly lead to messy, hard-to-maintain infrastructure code.
The real challenge isn't just adding toggles—it's adding them in a way that remains maintainable as your infrastructure grows from dozens to hundreds of resources. This is where implementation patterns become crucial. Let's explore sophisticated patterns that keep your infrastructure code clean and manageable even as toggle complexity increases.
Toggle Points and Toggle Routers
In traditional software, we separate the toggle point (where the decision is made) from the toggle router (which makes the decision). The same principle applies to infrastructure, and for good reason. When you scatter toggle logic throughout your code, you end up with what I call "toggle spaghetti"—conditional statements everywhere, making it nearly impossible to understand what combinations of toggles are active or how they interact with each other.
The solution is architectural: separate the places where you check toggle states (toggle points) from the place where you decide what those states should be (toggle router). This separation provides several key benefits: it centralizes complex toggle logic in one place, makes testing toggle combinations manageable, and allows you to evolve toggle decision logic without touching every resource that uses it.
Think of the toggle router as your infrastructure's "decision headquarters." It receives raw toggle inputs—boolean flags, environment names, team identifiers—and produces clean, contextual decisions that resources can use without needing to understand the underlying complexity.
Here's how Sarah's team refactored their growing collection of toggles into a cleaner pattern:
# Toggle Router Module - centralizes all toggle logic
module "toggle_router" {
source = "./modules/toggle-router"
feature_flags = {
pr_reminder = var.enable_pr_reminder
advanced_monitoring = var.enable_monitoring
beta_features = var.enable_beta
}
context = {
environment = var.environment
region = var.aws_region
team = var.team_name
}
}
# Toggle Points - resources simply use the decisions
resource "github_repository_webhook" "pr_reminder" {
count = module.toggle_router.decisions.pr_reminder ? 1 : 0
# ... configuration ...
}
The beauty of this pattern is that your resources don't need to know about the complex logic determining whether a feature should be enabled. They simply check the decision from the router. Here's what happens inside the router module:
# modules/toggle-router/main.tf
variable "feature_flags" {
type = map(bool)
}
variable "context" {
type = map(string)
}
locals {
# Complex routing logic centralized here
decisions = {
pr_reminder = (
var.feature_flags.pr_reminder &&
var.context.environment != "production"
) || (
var.feature_flags.pr_reminder &&
var.context.environment == "production" &&
contains(["platform", "devops"], var.context.team)
)
advanced_monitoring = (
var.feature_flags.advanced_monitoring &&
contains(["production", "staging"], var.context.environment)
)
beta_features = (
var.feature_flags.beta_features &&
var.context.environment == "development"
)
}
}
output "decisions" {
value = local.decisions
}
Inversion of Control
For more complex scenarios, we can use Inversion of Control to inject different infrastructure configurations based on toggle state. This pattern moves beyond simple on/off toggles to completely swapping out entire infrastructure implementations.
The key insight here is that instead of having your main configuration choose between different resource configurations, you let the toggle system choose which module to use entirely. This approach works particularly well when you're evaluating fundamentally different architectural approaches.
For example, Sarah's team needed to test three different repository management strategies: a standard approach for most teams, an experimental approach with advanced features, and a beta approach for early adopters. Rather than toggling individual features, they used module selection:
# Define interface for repository configuration
variable "repository_config_module" {
description = "Module path for repository configuration"
type = string
default = "./modules/standard-repo"
}
# Use dynamic module selection
module "selected_repo_config" {
source = var.repository_config_module
repo_name = var.repository_name
team_name = var.team_name
compliance = var.compliance_requirements
}
# In terraform.tfvars for different environments:
# Development: repository_config_module = "./modules/experimental-repo"
# Production: repository_config_module = "./modules/standard-repo"
# Beta: repository_config_module = "./modules/beta-repo"
Each module implements the same interface but with different behavior:
Each module implements the same interface but with different behavior. The standard module provides basic features, while the experimental module includes advanced capabilities like wikis, projects, and sophisticated merge strategies. This separation keeps each approach clean and testable while avoiding the complexity of conditional logic scattered throughout your configuration.
Strategy Pattern
The Strategy pattern is one of the most elegant approaches to handling complex infrastructure variations. It's particularly valuable when you have multiple related settings that need to change together coherently.
The real power of this pattern emerged when Sarah's team needed to handle different operational scenarios. During normal operations, they wanted conservative scaling. During product launches, they needed balanced scaling. During Black Friday, they required aggressive scaling. Rather than toggling dozens of individual settings and hoping they were compatible, they defined complete strategies:
locals {
scaling_strategies = {
conservative = {
min_size = 2
max_size = 10
target_cpu_utilization = 70
scale_up_cooldown = 300
scale_down_cooldown = 900
}
balanced = {
min_size = 3
max_size = 20
target_cpu_utilization = 60
scale_up_cooldown = 180
scale_down_cooldown = 600
}
aggressive = {
min_size = 5
max_size = 50
target_cpu_utilization = 50
scale_up_cooldown = 60
scale_down_cooldown = 300
}
}
selected_strategy = local.scaling_strategies[var.scaling_strategy]
}
resource "aws_autoscaling_group" "app" {
min_size = local.selected_strategy.min_size
max_size = local.selected_strategy.max_size
desired_capacity = local.selected_strategy.min_size
# ... other configuration ...
}
resource "aws_autoscaling_policy" "cpu" {
autoscaling_group_name = aws_autoscaling_group.app.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = local.selected_strategy.target_cpu_utilization
}
}
Toggle Configuration
As Sarah's team discovered, managing toggle configuration becomes increasingly important as the number of toggles grows. After adding toggles for the PR Reminder, advanced monitoring, beta features, and several other capabilities, they found themselves struggling to keep track of which toggles were active in which environments.
Unlike application feature toggles that can change at runtime, infrastructure toggles often need to be more static due to the nature of infrastructure provisioning. However, this constraint actually forces us to think more carefully about toggle design, leading to more robust and maintainable solutions.
Let's explore three approaches to toggle configuration, each offering different trade-offs between simplicity and flexibility.
The fundamental challenge with infrastructure toggle configuration is the tension between flexibility and safety. You want toggles to be configurable enough to support different environments and use cases, but stable enough that infrastructure changes are predictable and auditable. The patterns we explore here represent different points on this spectrum, from simple static configuration to sophisticated dynamic systems.
Static Configuration
The simplest approach uses Terraform/OpenTofu variables. This method treats toggles as compile-time constants that are resolved when you run terraform plan or opentofu plan.
While this might seem limiting compared to runtime feature flags in applications, it has several advantages for infrastructure:
Changes are explicit and version-controlled
Toggle states are clearly documented in your tfvars files
All team members can see exactly what configuration is active for each environment
No external dependencies that could fail during deployment
# toggles.tfvars - simple, clear, version-controlled
enable_pr_reminder = true
enable_beta_features = false
scaling_strategy = "balanced"
experiment_variant = "control"
Static configuration works particularly well for release toggles and long-lived operational toggles where you don't need frequent changes.
Hierarchical Configuration
For larger organizations, hierarchical configuration allows for overrides at different levels. Sarah's team discovered this need when they started managing infrastructure for multiple teams, each with different requirements but sharing common patterns.
The challenge was clear: the platform team needed certain security toggles always enabled, the frontend team needed CDN features, and the data team needed different backup strategies. Creating separate toggle variables for every combination would have resulted in hundreds of variables.
Instead, they implemented a hierarchical system where more specific contexts override more general ones:
# Global defaults
variable "global_toggles" {
type = map(bool)
default = {
enhanced_monitoring = true
auto_scaling = true
public_access = false
}
}
# Environment overrides
variable "environment_toggles" {
type = map(map(bool))
default = {
production = {
public_access = true
debug_mode = false
}
staging = {
debug_mode = true
}
development = {
auto_scaling = false
debug_mode = true
}
}
}
# Team overrides
variable "team_toggles" {
type = map(map(bool))
default = {
platform = {
enhanced_monitoring = true
experimental_features = true
}
frontend = {
cdn_enabled = true
}
}
}
locals {
# Merge configurations with precedence
effective_toggles = merge(
var.global_toggles,
lookup(var.environment_toggles, var.environment, {}),
lookup(var.team_toggles, var.team, {})
)
}
Dynamic Configuration
Sometimes toggles need to change without infrastructure reprovisioning. Sarah's team discovered this during an incident where they needed to quickly disable auto-scaling across all environments. Waiting for a code change, review, and deployment would have taken too long.
Dynamic configuration bridges this gap by reading toggle states from external systems during plan time. While the infrastructure code remains static, the toggle values can be updated immediately:
# Read toggle configuration from AWS Systems Manager Parameter Store
data "aws_ssm_parameter" "feature_toggles" {
name = "/infrastructure/toggles/${var.environment}"
}
locals {
toggle_config = jsondecode(data.aws_ssm_parameter.feature_toggles.value)
}
# Use in resources
resource "aws_lambda_function" "processor" {
count = local.toggle_config.lambda_processor_enabled ? 1 : 0
# ... configuration ...
environment {
variables = {
FEATURE_FLAGS = jsonencode(local.toggle_config)
}
}
}
Toggle Configuration Validation
As Sarah's team learned the hard way, it's crucial to validate toggle configurations to prevent invalid states. They once had an outage because someone enabled the PR Reminder feature while leaving the mode set to "off"—the Lambda functions were created but never received the correct configuration.
Terraform and OpenTofu provide built-in validation capabilities that catch these errors during planning, before they can affect your infrastructure:
variable "toggle_config" {
type = object({
pr_reminder_enabled = bool
pr_reminder_mode = string
scaling_strategy = string
experiment_enabled = bool
experiment_variant = string
})
validation {
condition = contains(
["off", "passive", "active", "aggressive"],
var.toggle_config.pr_reminder_mode
)
error_message = "Invalid PR reminder mode."
}
validation {
condition = !(
var.toggle_config.pr_reminder_enabled &&
var.toggle_config.pr_reminder_mode == "off"
)
error_message = "PR reminder cannot be enabled with mode 'off'."
}
validation {
condition = !(
var.toggle_config.experiment_enabled &&
var.toggle_config.experiment_variant == ""
)
error_message = "Experiment variant must be specified when experiment is enabled."
}
}
Working with Feature-Flagged Infrastructure
After six months of using feature toggles, Sarah's team had learned valuable lessons about operating infrastructure with toggles. The patterns that worked in theory sometimes broke down in practice, and they had to develop new approaches to testing, monitoring, and maintenance.
Let's explore the practices they developed to manage their feature-flagged infrastructure effectively.
Testing Toggle Combinations
The combinatorial explosion of toggle states can make testing challenging. With just five boolean toggles, you have 32 possible combinations. Sarah's team learned this when their test suite started taking hours to run.
The solution wasn't to test everything—it was to test strategically. They identified three critical scenarios that covered 90% of their use cases:
# test/toggle_combinations.tf
locals {
test_scenarios = [
{
name = "all_disabled" # Baseline: everything off
toggles = {
pr_reminder = false
monitoring = false
auto_scaling = false
}
},
{
name = "production_standard" # Typical production setup
toggles = {
pr_reminder = true
monitoring = true
auto_scaling = true
}
},
{
name = "minimal_staging" # Cost-optimized staging
toggles = {
pr_reminder = false
monitoring = true
auto_scaling = false
}
}
]
}
# The 'for_each' construct creates multiple test environments in parallel
module "test_infrastructure" {
for_each = { for s in local.test_scenarios : s.name => s }
source = "../modules/infrastructure"
toggles = each.value.toggles
environment = "test-${each.key}"
}
Toggle Debt and Lifecycle Management
Feature toggles in infrastructure can accumulate as "toggle debt." Sarah's team discovered this problem six months in, when they found 23 toggles in their codebase—12 of which nobody could remember the purpose of.
Unlike application code where you can just delete old flags, infrastructure toggles often control expensive resources. The team needed a systematic approach to lifecycle management:
# Document toggle lifecycle in code
variable "pr_reminder_toggle" {
description = <<-EOT
Controls PR Reminder feature rollout
Created: 2024-01-15
Owner: platform-team
Expected removal: 2024-03-01
Status: Active rollout in progress
EOT
type = bool
default = false
}
# Automated toggle expiration checking
locals {
toggle_metadata = {
pr_reminder = {
created = "2024-01-15"
expires = "2024-03-01"
owner = "platform-team"
}
legacy_monitoring = {
created = "2023-06-01"
expires = "2023-09-01" # Overdue!
owner = "sre-team"
}
}
expired_toggles = [
for name, meta in local.toggle_metadata :
name if timestamp() > timeadd(meta.expires, "0s")
]
}
# This resource will cause the Terraform/OpenTofu plan to fail if expired toggles exist
# forcing the team to either remove the toggle or extend its lifetime with justification
resource "null_resource" "check_toggle_expiration" {
count = length(local.expired_toggles) > 0 ? 1 : 0
provisioner "local-exec" {
command = "echo 'ERROR: Expired toggles found: ${join(", ", local.expired_toggles)}' && exit 1"
}
}
This approach transformed toggle cleanup from a manual chore to an automated gate. When the PR Reminder toggle hit its expiration date, the team had to make an explicit decision: remove it (because the feature was stable) or extend it with a documented reason.
Monitoring and Observability
Infrastructure toggles need proper monitoring. Sarah's team learned this during an incident where a misconfigured toggle increased their AWS bill by $30,000 in one incident over a single weekend. The toggle had enabled expensive GPU instances in all regions, but nobody noticed until the billing alert fired.
After that expensive lesson, they built comprehensive monitoring at three levels:
Configuration Level: A CloudWatch dashboard showing current toggle states
Impact Level: Metrics tracking how toggles affect costs and performance
Operational Level: Alerts when toggle-controlled resources misbehave
# Example: Dashboard for at-a-glance toggle visibility
resource "aws_cloudwatch_dashboard" "toggle_monitoring" {
dashboard_name = "infrastructure-toggles"
dashboard_body = jsonencode({
widgets = [
{
type = "text"
properties = {
markdown = "## Current Toggle States\n\n| Toggle | State | Environment |"
}
},
{
type = "metric"
properties = {
metrics = [["Custom/Toggles", "ToggleUsage"]]
title = "Toggle-Controlled Feature Usage"
}
}
]
})
}
This comprehensive monitoring caught issues early. When a developer accidentally enabled expensive GPU instances through a toggle, the cost alert fired within hours instead of waiting for the monthly bill.
Toggle Governance
As more teams started using toggles, Sarah realized they needed governance to prevent chaos. Different teams were using different naming conventions, creating toggles without documentation, and worst of all, creating conflicting toggles that interfered with each other.
The solution was to embed governance rules directly into the infrastructure code:
# Define toggle governance rules
module "toggle_governance" {
source = "./modules/governance"
rules = {
max_toggles_per_module = 5
max_toggle_age_days = 90
required_approvers = 2
naming_convention = "^(release|experiment|ops|permission)_[a-z_]+$"
required_tags = [
"owner",
"created_date",
"expected_removal",
"category"
]
}
current_toggles = {
release_pr_reminder = {
owner = "platform-team"
created_date = "2024-01-15"
expected_removal = "2024-03-01"
category = "release"
}
ops_scaling_override = {
owner = "sre-team"
created_date = "2024-01-01"
expected_removal = "permanent"
category = "ops"
}
}
}
# Governance module validates and reports
output "governance_report" {
value = module.toggle_governance.validation_report
}
Conclusion: From Theory to Practice
Remember Sarah's team and their PR Reminder feature? What started as a complex challenge—delivering stable infrastructure while continuing development on experimental features—became a journey of discovery about how feature toggles transform infrastructure management.
Feature toggles in Infrastructure as Code represent a fundamental shift in how we think about infrastructure management. No longer are we constrained by the binary nature of traditional infrastructure deployments—where resources either exist or they don't, where configurations are either active or they're not. The patterns demonstrated through our PR Reminder story—from simple boolean flags to sophisticated rollout strategies—show how infrastructure can evolve to match the flexibility we've come to expect from application deployments.
This evolution isn't just about technical capability; it's about changing the risk profile of infrastructure changes. Traditional infrastructure deployment is high-stakes: you're committing to a configuration before you know how it will behave in production. Feature toggles transform this into a low-stakes decision: you can deploy infrastructure changes while keeping the option to quickly revert or modify behavior based on real-world feedback.
The journey from our initial simple toggle—count = var.enable_pr_reminder ? 1 : 0—to the sophisticated rollout strategies, monitoring systems, and governance frameworks we explored demonstrates how feature toggles grow with your organizational needs. They start simple and can remain simple if that's all you need. But when your infrastructure becomes critical to business operations, they can evolve to provide the safety, observability, and control mechanisms that enterprise-scale infrastructure requires.
Real-World Success Stories
The impact of feature toggles extends far beyond Sarah's team. Organizations across industries are seeing transformative results:
Healthcare Software Provider:
Deployment time: 4.5 hours → 1.5 hours (70% reduction)
Failed deployments: 15% → 3% (80% reduction)
Monthly infrastructure costs: $45,000 → $38,000 (15% savings)
Financial Services Company (Multi-Cloud Migration):
Migration timeline: 18 months → 6 months
Outages during migration: 0
Cost optimization: 30% reduction through multi-cloud arbitrage
Disaster recovery time: 4 hours → 30 minutes
E-Commerce Platform (Black Friday 2023):
Peak traffic handled: 10x normal load
Infrastructure cost during event: +250% (vs +600% previous year)
Response time during peak: 250ms (vs 2000ms previous year)
Revenue: +45% year-over-year
These aren't edge cases—they're becoming the norm for organizations that embrace feature toggles in their infrastructure.
The 2023 fork that created OpenTofu has accelerated innovation in this space. While Terraform focuses on enterprise features like mock providers and native testing, OpenTofu pushes boundaries with state encryption and provider-defined functions. Both tools now offer robust support for feature toggles, though with different approaches.
Key Lessons for Implementing Feature Toggles in Infrastructure
Start Simple: Begin with basic boolean toggles and evolve as needed
Categorize Appropriately: Use the right type of toggle for your use case
Manage Lifecycle: Track creation and removal of toggles to prevent debt
Centralize Decision Logic: Use toggle routers to avoid scattered conditionals
Monitor Everything: Track toggle states and their impact on infrastructure
Plan for Testing: Consider the combinatorial explosion of toggle states
Implement Governance: Establish rules and processes for toggle management
Embrace GitOps: Integrate with modern deployment pipelines
Measure Impact: Track metrics to prove value
Clean Up Regularly: Remove toggles that have served their purpose
The Future: Infrastructure as Gradually Mutable
As we look ahead, the convergence of GitOps, feature toggles, and Infrastructure as Code points to a future where infrastructure is no longer immutable but gradually mutable. Tools like ArgoCD and FluxCD, combined with progressive delivery systems like Flagger, are making it possible to apply the same sophisticated deployment strategies to infrastructure that we use for applications.
The PR Reminder feature that seemed so complex at the start of our tale would be routine in this future—infrastructure that adapts based on real-time metrics, user feedback, and business requirements, all while maintaining the safety and auditability that Infrastructure as Code provides.
Back to Sarah's Team
Six months after implementing feature toggles, Sarah's team has transformed how they deliver infrastructure. The PR Reminder feature that once threatened to derail their landing zone deployment is now smoothly running across 80% of the organization's repositories. Teams can choose their preferred reminder frequency based on A/B test results. The canary release strategy caught three critical bugs before they affected the wider organization. And most importantly, the platform team is no longer seen as a bottleneck—they're enablers of innovation.
The journey wasn't without challenges. They had to clean up toggle debt, implement governance, and build monitoring systems. But the investment paid off: zero infrastructure-related outages in the last quarter, 70% faster feature delivery, and a team that sleeps better at night knowing they can quickly respond to any issue.
Feature toggles transform Infrastructure as Code from static definitions into dynamic, adaptable systems. They enable practices like canary deployments, A/B testing, and gradual rollouts that were once the exclusive domain of application code. However, with this power comes responsibility—every toggle adds complexity that must be managed.
Remember: feature toggles in infrastructure are not just about controlling what gets deployed—they're about enabling safer, more confident infrastructure evolution. Use them wisely, manage them carefully, and remove them promptly when their purpose is served.
The techniques and patterns described in this article work with both Terraform and OpenTofu, though some advanced features are tool-specific as noted. The principles remain the same whether you're managing AWS resources, Kubernetes configurations, or any other infrastructure components.
For code examples and implementation patterns, visit: github.com/example/infrastructure-feature-toggles