The first time I worked on a system with serious feature flags, I assumed they were a product thing. A/B tests, gradual launches, beta programs. That use case is real, and it is not why a working engineering team needs them.
The actual reason is deploys. Every non-trivial change goes behind a flag, and the flag is how you control the blast radius separately from the deploy. The code lands on main on Monday with the flag off. Wednesday I turn it on for myself. Friday I turn it on for 1%. Next week, 100%. Two weeks later the flag is deleted. Deploy and rollout are decoupled. That is the whole point and most of the value comes from that one sentence.
Why this beats branches and beats environments
Permalink to “Why this beats branches and beats environments”Long-lived branches are how you get merge conflicts. Staging environments are how you get drift between staging and prod. Both are attempts to answer the same question, which is “can I make this code visible to a small group before everyone sees it”, and both answer it badly, because the only environment that actually has production traffic is production.
A flag answers the question properly. The code is already on main, already deployed to production, already exercised by every request. It just does not do anything for users until you flip the flag.
The minimum system that actually works
Permalink to “The minimum system that actually works”Most teams reach for LaunchDarkly or something similar before they need it. The minimum version is two tables and one function:
CREATE TABLE feature_flags (
name text primary key,
enabled boolean not null default false,
rollout_pct int not null default 0 -- 0 to 100
);
CREATE TABLE feature_flag_overrides (
flag_name text references feature_flags(name),
user_id text not null,
enabled boolean not null,
primary key (flag_name, user_id)
);def is_enabled(flag: str, user_id: str) -> bool:
override = _get_override(flag, user_id)
if override is not None:
return override
flag_row = _get_flag(flag)
if flag_row is None or not flag_row.enabled:
return False
if flag_row.rollout_pct >= 100:
return True
bucket = hash_to_bucket(f"{flag}:{user_id}", buckets=100)
return bucket < flag_row.rollout_pcthash_to_bucket is int(hashlib.md5(...).hexdigest(), 16) % 100. Stable per user, deterministic, no database load. Cache the flag rows for a minute and you have a flag system that costs essentially nothing and supports everything you will actually do week to week.
This covers the things you reach for: turn a feature on for me only via a feature_flag_overrides row, turn it on for the dogfood team with a few more overrides, roll out to 1%, 10%, 50%, 100% by changing rollout_pct, or hit the kill switch by flipping enabled to false (which goes stable for everyone in roughly five seconds depending on your cache TTL).
A SaaS gives you targeting rules, segments, scheduled rollouts, audit logs, SDKs in 12 languages. Switch when you actually want those things. Most teams never do.
The discipline
Permalink to “The discipline”A flag system that works has a few habits attached to it. Every non-trivial change goes behind a flag, by default. The flag has an owner and an expiry date in the code, I keep them in a comment, something like # flag: new_pricing_engine, owner: kg, remove_by: 2026-07-01. The PR that turns the flag on is small, usually one line, ideally opened by somebody other than the author. The rollout review is a separate review from the implementation one.
The rule teams skip, every time, is deletion. A flag that has been on at 100% for a month is technical debt. Old flags accumulate, every conditional gets a bit harder to read, and eventually nobody knows which flags are load-bearing any more. The fix is whatever works for you: a calendar reminder, a CI check that fails on flags past their remove_by date, or a recurring half-day that any senior IC can run.
When the simple version isn’t enough
Permalink to “When the simple version isn’t enough”A handful of cases where you do need to do something more.
Flag state on the request hot path at high QPS is fine if you cache for a minute. If you can’t tolerate a one-minute lag you will need a push channel, Redis pubsub or a sidecar of some kind. Most teams can tolerate the lag.
Cross-service consistency is the one that actually hurts. A flag that flips mid-request can leave service A treating it as new-flow while service B is still on old-flow. Resolve the flag once at the edge and pass the decision along the call chain.
Anything user-visible that requires legal sign-off has crossed over from being a deploy tool into being a product feature, and the rules above stop applying.
The boring 50-line version covers everything else. The system itself is the easy part. What does the work is the habit of writing the flag, owning it, and deleting it on time.