PII-aware logging in structlog

Logging in a service that handles user data has two failure modes. You log too little, can’t debug anything, and end up adding print statements during incidents. Or you log too much, you leak PII, and you discover during the next audit that your logs are full of email addresses.

The fix that has held up for me, used across services built on the Superlinked framework (our older, open-source product), is a naming convention enforced by a structlog processor. Anything sensitive goes under a pii_ prefix. A processor at the start of the chain drops those keys unless an env var explicitly says to keep them.

import structlog

PII_PREFIX = "pii_"


def filter_pii(_logger, _name, event_dict):
    return {k: v for k, v in event_dict.items() if not k.startswith(PII_PREFIX)}

That is the entire mechanism. At call sites:

log.info(
    "request handled",
    request_id=request_id,
    duration_ms=duration_ms,
    pii_payload=payload,
    pii_user_email=user.email,
)

In production only request_id and duration_ms survive. Locally I set EXPOSE_PII=1, the processor is removed from the chain, and I see everything.

The lazy-args processor

Permalink to “The lazy-args processor”

Some log values are expensive to compute: the repr of a 50k-element vector, a JSON dump of a large response, that sort of thing. You do not want to pay for them when the log level is going to drop the message anyway. The fix is to allow callable values and call them only at format time:

def evaluate_lazy_arguments(_logger, _name, event_dict):
    return {k: (v() if callable(v) else v) for k, v in event_dict.items()}

Use it like this:

log.debug(
    "vector computed",
    vector_repr=lambda: str(big_vector),
)

If the message is filtered out (wrong level, or pii_ and PII is off), the lambda never runs. Free.

The order really matters here. filter_pii first, evaluate_lazy_arguments second. That way a pii_* lambda gets dropped before it is ever called. I learned this the slow way after a profile showed me lambda-evaluation in a hot loop that was supposed to be filtered out anyway.

The recursion guard

Permalink to “The recursion guard”

Once in a while a log call crashes the service because some logged object has a cyclic repr. structlog will happily call str() on it, hit RecursionError, and propagate the exception out of log.info(...). A logger that can crash the program is worse than no logger at all.

import reprlib


def recursion_handler(_logger, _name, event_dict):
    safe_repr = reprlib.Repr().repr
    out = {}
    for k, v in event_dict.items():
        try:
            repr(v)
            out[k] = v
        except RecursionError:
            out[k] = safe_repr(v)
    return out

reprlib.Repr truncates aggressively and is safe against cycles. The cost is paid only on objects that actually misbehave.

The chain

Permalink to “The chain”
structlog.configure(
    processors=[
        filter_pii,                    # convention enforcement
        evaluate_lazy_arguments,       # cost control
        recursion_handler,             # safety
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ]
)

Three custom processors, around 25 lines combined. The ordering between the first two is the part that bit me, and probably the part most worth carrying away.

The naming convention is the bit I would emphasize. pii_ is opt-in. If a developer forgets, the value gets logged in plaintext, which is a real problem. But it is one searchable convention, easy to enforce in review, and cheap to grep for in CI. I will take an opt-in rule that developers actually remember over an allow-list that drifts out of date every other sprint.