Feature Flags for AI Features: Shipping Safely When Outputs Are Non-Deterministic

You deployed a new AI feature on Friday. By Monday, it had sent 3,000 support emails with a factual error in the third paragraph. Your monitoring showed a 200 status on every request. Nothing was broken — the model just learned to say something wrong in a way that passed all your tests.

Rollback was a git revert and a deploy. But 3,000 emails had already gone out.

Feature flags for AI features are not just deployment tooling. They are risk management for a class of system where correctness is probabilistic and failures are quiet.

Why AI Features Are Different

Standard feature flags manage rollout of deterministic code. Toggle the flag off and the new code path stops executing. If you shipped a bug, it stops executing too. The damage is bounded by the time between deploy and rollback.

AI features fail differently:

Silent quality degradation: The feature keeps working — it just produces outputs that are wrong, off-brand, or unhelpful. Your error rate stays at zero. Only users or a human reviewer notices.

Non-reproducible failures: An LLM will give different outputs to the same input across calls. If a user complains, you can’t reliably reproduce what they saw. The failure might not happen again in testing.

Action-based consequences: If your AI feature sends emails, posts content, modifies data, or calls APIs, a bad output has effects outside your system. Those effects persist after rollback.

Feature flags don’t prevent these failures. They reduce the blast radius — fewer users affected, less time before you catch a problem, faster path to mitigation.

The Basic Setup

Most teams use a feature flag service like LaunchDarkly, Unleash, or a simple database-backed implementation. The pattern for an AI feature:

from ldclient import Context

def get_ai_summary(document: str, user_id: str) -> str:
    user_context = Context.builder(user_id).kind("user").build()
    
    use_ai_summary = ld_client.variation(
        "ai-document-summary",
        user_context,
        default=False  # Fallback to non-AI version
    )
    
    if use_ai_summary:
        return generate_ai_summary(document)
    else:
        return generate_rule_based_summary(document)  # Existing behavior

The default is False. If the flag service is unavailable, users get the old behavior. This is important — flag failures should fail safe.

Percentage Rollout

The core technique for AI features: roll out to a small percentage first, watch what happens, then expand.

# LaunchDarkly handles percentage rollout in the flag config
# But you can also do it manually:

import hashlib

def should_use_ai_feature(user_id: str, rollout_percentage: int) -> bool:
    """Consistent percentage rollout — same user always gets same result."""
    hash_value = int(hashlib.md5(
        f"ai-feature-v2:{user_id}".encode()
    ).hexdigest(), 16)
    return (hash_value % 100) < rollout_percentage

# Start at 1%
if should_use_ai_feature(user_id, rollout_percentage=1):
    response = generate_ai_response(query)
else:
    response = generate_legacy_response(query)

The hash ensures consistency: the same user always hits the same code path, which is important for reproducible behavior and for A/B comparisons.

Typical rollout schedule for a high-stakes AI feature:

Day 1: 1% (internal users, beta testers)
Day 3: 5% if no quality issues surfaced
Day 7: 20%
Day 14: 50%
Day 21: 100%

The exact timeline depends on volume. At 10,000 requests/day, 1% gives you 100 real-world examples to evaluate by end of day. At 100 requests/day, you need a higher initial percentage or a longer ramp.

Canary by User Segment

Percentage rollout is blunt. For better control, roll out to specific segments first:

def resolve_ai_flag(user: User) -> bool:
    # Internal users first
    if user.is_internal:
        return True
    
    # Beta program participants
    if user.beta_opt_in:
        return ld_client.variation("ai-feature", user.context, False)
    
    # Enterprise tier (high-value, direct feedback channel)
    if user.tier == "enterprise" and random() < 0.05:
        return True
    
    return False

Internal users catch obvious failures before external users see them. Beta users have opted in to seeing new features and often provide explicit feedback. Enterprise users are watched closely — any issues surface quickly through account managers.

This sequence gives you staged quality gates before broad rollout.

What to Monitor During Rollout

Standard API monitoring (error rates, latency) is necessary but not sufficient for AI features. Add these:

Output length distribution: AI-generated content that’s suddenly 5x longer or 50% shorter than baseline is often a signal of a prompt regression or model change.

User engagement signals: For content generation — do users edit the AI output before using it? High edit rates mean the AI output isn’t quite right, even when users don’t complain explicitly.

Downstream action rates: If the AI feature generates something that triggers a downstream action (user clicks, confirms, shares), compare rates between flag-on and flag-off groups. A drop suggests output quality degraded.

Explicit feedback: Add a minimal feedback signal near AI-generated content. Even a thumbs down button gives you a noisy but real quality signal.

def track_ai_output(user_id: str, output: str, flag_variant: str):
    analytics.track("ai_output_generated", {
        "user_id": user_id,
        "flag_variant": flag_variant,  # "control" or "treatment"
        "output_length": len(output),
        "output_word_count": len(output.split()),
        "model": current_model,
        "timestamp": datetime.utcnow().isoformat()
    })

Compare these metrics between flag variants during rollout. Divergence is a signal — investigate before expanding.

Automatic Kill Switches

Manual monitoring requires someone to be watching. For production AI features with real-world consequences, add automatic kill switches:

from dataclasses import dataclass

@dataclass
class AISafetyMonitor:
    max_error_rate: float = 0.05
    max_avg_output_length: int = 5000
    window_minutes: int = 15
    
    def should_disable_feature(self, feature_name: str) -> bool:
        metrics = self.get_recent_metrics(feature_name, self.window_minutes)
        
        if metrics.error_rate > self.max_error_rate:
            self.log_and_alert(f"High error rate: {metrics.error_rate:.1%}")
            return True
        
        if metrics.avg_output_length > self.max_avg_output_length:
            self.log_and_alert(f"Abnormal output length: {metrics.avg_output_length}")
            return True
        
        return False

# In your request handler
monitor = AISafetyMonitor()

if monitor.should_disable_feature("ai-document-summary"):
    # Emergency disable — bypass flag, use safe fallback
    response = generate_rule_based_summary(document)
else:
    response = generate_ai_summary(document) if flag_enabled else generate_rule_based_summary(document)

This is a last-resort circuit breaker. Set the thresholds conservatively — they should trigger on genuinely anomalous behavior, not normal variance.

Model Version Flags

When your LLM provider releases a new model version, treat it like any other AI feature change: flag it in.

def get_model_for_request(user_id: str, feature: str) -> str:
    model_flag = ld_client.variation(
        f"{feature}-model-version",
        user_context,
        default="gpt-4o-2024-11"  # Stable baseline
    )
    return model_flag  # Returns "gpt-4o-2025-05" for treatment group

This lets you test a new model version on real traffic against your actual prompts, with easy rollback if quality degrades. Model providers sometimes change behavior in minor versions without announcing it explicitly. A flag gives you the data to detect this before it affects all users.

The Permanent Flag Pattern

Most feature flags are temporary — they exist until the feature is fully rolled out, then they’re deleted. For AI features with real-world consequences (email sending, content publishing, data modification), consider making the kill switch permanent:

# This flag is never "fully on" — it stays in your flag dashboard forever
AI_EMAIL_DRAFTING_ENABLED = ld_client.variation(
    "ai-email-drafting",   # Never cleaned up
    user_context,
    default=True
)

A permanent flag means that when an LLM provider has an outage or you discover a systematic quality problem, you can disable the AI path for all users in under a minute, without a code deploy. That’s worth the minor overhead of maintaining a flag that’s almost always on.

Where to Start

If you’re shipping your first production AI feature without any flag infrastructure:

Add a simple boolean check to the feature code path. Even a database-backed flag with no percentage rollout is better than nothing.
Roll out to internal users first, with a shared Slack channel for feedback.
Add logging for the metrics that matter to your feature — output length, downstream action rates, explicit feedback.
Set a calendar reminder to expand the rollout in one week if no issues surface.

The goal isn’t a perfect system. It’s a way to catch problems on 1% of users instead of 100%.

AI features that fail silently are the hardest kind to diagnose after the fact. Flags give you a recovery path and, more importantly, a way to contain the problem while you figure out what happened.

Feature Flags for AI Features: Shipping Safely When Outputs Are Non-Deterministic

Why AI Features Are Different

The Basic Setup

Percentage Rollout

Canary by User Segment

What to Monitor During Rollout

Automatic Kill Switches

Model Version Flags

The Permanent Flag Pattern

Where to Start

Caching LLM Responses: When It Helps, When It Hurts, and How to Implement It

LLM API Costs Are Out of Control: A Production Guide to Cutting Your Bill

More from AI Integration

LLM Evals in Practice: Testing AI Features Before They Go Wrong

LLM API Costs Are Out of Control: A Production Guide to Cutting Your Bill

Writing AI IDE Rules That Actually Work: Cursor, Windsurf, and Copilot

Working notes from
the studio.

Join the conversation.

Why AI Features Are Different

The Basic Setup

Percentage Rollout

Canary by User Segment

What to Monitor During Rollout

Automatic Kill Switches

Model Version Flags

The Permanent Flag Pattern

Where to Start

Caching LLM Responses: When It Helps, When It Hurts, and How to Implement It

LLM API Costs Are Out of Control: A Production Guide to Cutting Your Bill

More from AI Integration

LLM Evals in Practice: Testing AI Features Before They Go Wrong

LLM API Costs Are Out of Control: A Production Guide to Cutting Your Bill

Writing AI IDE Rules That Actually Work: Cursor, Windsurf, and Copilot

Working notes fromthe studio.

Join the conversation.

Working notes from
the studio.