AI Integration · Production Engineering
Feature Flags for AI Features: Shipping Safely When Outputs Are Non-Deterministic
Rolling back a bad API endpoint takes seconds. Rolling back a bad LLM integration is harder — the damage may already be in your logs, your users' inboxes, or your clients' feeds. Feature flags are how you ship AI features without betting everything on launch day.
Anurag Verma
7 min read
Sponsored
You deployed a new AI feature on Friday. By Monday, it had sent 3,000 support emails with a factual error in the third paragraph. Your monitoring showed a 200 status on every request. Nothing was broken — the model just learned to say something wrong in a way that passed all your tests.
Rollback was a git revert and a deploy. But 3,000 emails had already gone out.
Feature flags for AI features are not just deployment tooling. They are risk management for a class of system where correctness is probabilistic and failures are quiet.
Why AI Features Are Different
Standard feature flags manage rollout of deterministic code. Toggle the flag off and the new code path stops executing. If you shipped a bug, it stops executing too. The damage is bounded by the time between deploy and rollback.
AI features fail differently:
Silent quality degradation: The feature keeps working — it just produces outputs that are wrong, off-brand, or unhelpful. Your error rate stays at zero. Only users or a human reviewer notices.
Non-reproducible failures: An LLM will give different outputs to the same input across calls. If a user complains, you can’t reliably reproduce what they saw. The failure might not happen again in testing.
Action-based consequences: If your AI feature sends emails, posts content, modifies data, or calls APIs, a bad output has effects outside your system. Those effects persist after rollback.
Feature flags don’t prevent these failures. They reduce the blast radius — fewer users affected, less time before you catch a problem, faster path to mitigation.
The Basic Setup
Most teams use a feature flag service like LaunchDarkly, Unleash, or a simple database-backed implementation. The pattern for an AI feature:
from ldclient import Context
def get_ai_summary(document: str, user_id: str) -> str:
user_context = Context.builder(user_id).kind("user").build()
use_ai_summary = ld_client.variation(
"ai-document-summary",
user_context,
default=False # Fallback to non-AI version
)
if use_ai_summary:
return generate_ai_summary(document)
else:
return generate_rule_based_summary(document) # Existing behavior
The default is False. If the flag service is unavailable, users get the old behavior. This is important — flag failures should fail safe.
Percentage Rollout
The core technique for AI features: roll out to a small percentage first, watch what happens, then expand.
# LaunchDarkly handles percentage rollout in the flag config
# But you can also do it manually:
import hashlib
def should_use_ai_feature(user_id: str, rollout_percentage: int) -> bool:
"""Consistent percentage rollout — same user always gets same result."""
hash_value = int(hashlib.md5(
f"ai-feature-v2:{user_id}".encode()
).hexdigest(), 16)
return (hash_value % 100) < rollout_percentage
# Start at 1%
if should_use_ai_feature(user_id, rollout_percentage=1):
response = generate_ai_response(query)
else:
response = generate_legacy_response(query)
The hash ensures consistency: the same user always hits the same code path, which is important for reproducible behavior and for A/B comparisons.
Typical rollout schedule for a high-stakes AI feature:
- Day 1: 1% (internal users, beta testers)
- Day 3: 5% if no quality issues surfaced
- Day 7: 20%
- Day 14: 50%
- Day 21: 100%
The exact timeline depends on volume. At 10,000 requests/day, 1% gives you 100 real-world examples to evaluate by end of day. At 100 requests/day, you need a higher initial percentage or a longer ramp.
Canary by User Segment
Percentage rollout is blunt. For better control, roll out to specific segments first:
def resolve_ai_flag(user: User) -> bool:
# Internal users first
if user.is_internal:
return True
# Beta program participants
if user.beta_opt_in:
return ld_client.variation("ai-feature", user.context, False)
# Enterprise tier (high-value, direct feedback channel)
if user.tier == "enterprise" and random() < 0.05:
return True
return False
Internal users catch obvious failures before external users see them. Beta users have opted in to seeing new features and often provide explicit feedback. Enterprise users are watched closely — any issues surface quickly through account managers.
This sequence gives you staged quality gates before broad rollout.
What to Monitor During Rollout
Standard API monitoring (error rates, latency) is necessary but not sufficient for AI features. Add these:
Output length distribution: AI-generated content that’s suddenly 5x longer or 50% shorter than baseline is often a signal of a prompt regression or model change.
User engagement signals: For content generation — do users edit the AI output before using it? High edit rates mean the AI output isn’t quite right, even when users don’t complain explicitly.
Downstream action rates: If the AI feature generates something that triggers a downstream action (user clicks, confirms, shares), compare rates between flag-on and flag-off groups. A drop suggests output quality degraded.
Explicit feedback: Add a minimal feedback signal near AI-generated content. Even a thumbs down button gives you a noisy but real quality signal.
def track_ai_output(user_id: str, output: str, flag_variant: str):
analytics.track("ai_output_generated", {
"user_id": user_id,
"flag_variant": flag_variant, # "control" or "treatment"
"output_length": len(output),
"output_word_count": len(output.split()),
"model": current_model,
"timestamp": datetime.utcnow().isoformat()
})
Compare these metrics between flag variants during rollout. Divergence is a signal — investigate before expanding.
Automatic Kill Switches
Manual monitoring requires someone to be watching. For production AI features with real-world consequences, add automatic kill switches:
from dataclasses import dataclass
@dataclass
class AISafetyMonitor:
max_error_rate: float = 0.05
max_avg_output_length: int = 5000
window_minutes: int = 15
def should_disable_feature(self, feature_name: str) -> bool:
metrics = self.get_recent_metrics(feature_name, self.window_minutes)
if metrics.error_rate > self.max_error_rate:
self.log_and_alert(f"High error rate: {metrics.error_rate:.1%}")
return True
if metrics.avg_output_length > self.max_avg_output_length:
self.log_and_alert(f"Abnormal output length: {metrics.avg_output_length}")
return True
return False
# In your request handler
monitor = AISafetyMonitor()
if monitor.should_disable_feature("ai-document-summary"):
# Emergency disable — bypass flag, use safe fallback
response = generate_rule_based_summary(document)
else:
response = generate_ai_summary(document) if flag_enabled else generate_rule_based_summary(document)
This is a last-resort circuit breaker. Set the thresholds conservatively — they should trigger on genuinely anomalous behavior, not normal variance.
Model Version Flags
When your LLM provider releases a new model version, treat it like any other AI feature change: flag it in.
def get_model_for_request(user_id: str, feature: str) -> str:
model_flag = ld_client.variation(
f"{feature}-model-version",
user_context,
default="gpt-4o-2024-11" # Stable baseline
)
return model_flag # Returns "gpt-4o-2025-05" for treatment group
This lets you test a new model version on real traffic against your actual prompts, with easy rollback if quality degrades. Model providers sometimes change behavior in minor versions without announcing it explicitly. A flag gives you the data to detect this before it affects all users.
The Permanent Flag Pattern
Most feature flags are temporary — they exist until the feature is fully rolled out, then they’re deleted. For AI features with real-world consequences (email sending, content publishing, data modification), consider making the kill switch permanent:
# This flag is never "fully on" — it stays in your flag dashboard forever
AI_EMAIL_DRAFTING_ENABLED = ld_client.variation(
"ai-email-drafting", # Never cleaned up
user_context,
default=True
)
A permanent flag means that when an LLM provider has an outage or you discover a systematic quality problem, you can disable the AI path for all users in under a minute, without a code deploy. That’s worth the minor overhead of maintaining a flag that’s almost always on.
Where to Start
If you’re shipping your first production AI feature without any flag infrastructure:
- Add a simple boolean check to the feature code path. Even a database-backed flag with no percentage rollout is better than nothing.
- Roll out to internal users first, with a shared Slack channel for feedback.
- Add logging for the metrics that matter to your feature — output length, downstream action rates, explicit feedback.
- Set a calendar reminder to expand the rollout in one week if no issues surface.
The goal isn’t a perfect system. It’s a way to catch problems on 1% of users instead of 100%.
AI features that fail silently are the hardest kind to diagnose after the fact. Flags give you a recovery path and, more importantly, a way to contain the problem while you figure out what happened.
Sponsored
More from this category
More from AI Integration
LLM Evals in Practice: Testing AI Features Before They Go Wrong
LLM API Costs Are Out of Control: A Production Guide to Cutting Your Bill
Writing AI IDE Rules That Actually Work: Cursor, Windsurf, and Copilot
Sponsored
The dispatch
Working notes from
the studio.
A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored