Cheaper logs without going dark

At TIER, an SRE thread had been bouncing for weeks. The Datadog bill was uncomfortable, and SRE wanted teams to log less. One of the tech-leads on the receiving end wanted to log everything, and his argument was reasonable. When something broke at 3am, the logs were what he used to fix it. He had debugged enough incidents that way to not want to give them up.

Neither side was really answering the other’s question. I was the standin for that tech-lead while he was OOO, which is how the fight ended up on my desk. Two days of escalation traffic about an argument I had not been part of and was now apparently representing one side of.

I read through the thread carefully and went looking for the actual disagreement.

The reframe

Permalink to “The reframe”

The tech-lead did not actually want every log line. He wanted the log lines that would help him at 3am when something was broken. Most of the time nothing was broken. Those log lines were worth their cost on the rare day an incident happened. The rest of the time they were noise that cost real money to store.

SRE’s proposal was “drop INFO” or “set the floor to WARN.” That is too coarse. Some INFO logs are useful in incidents, others are routine acknowledgements nobody has ever read, and lifting the floor by a level deletes both.

The question was not “should we log less”, it was “are we logging the right things.” Once I put it that way both sides had something they could agree on.

What I built

Permalink to “What I built”

The first change was the obvious one. A non-trivial chunk of our INFO volume was lines like “request handled, status=200, duration_ms=84.” That is a metric. It had been written as a log line on day one because someone wrote it as a log line on day one, and nobody had revisited. I migrated those to counters and histograms. The dashboards got better. The volume dropped. Nothing was lost, because the question “how many requests succeeded” is a metric question, not a log question.

The second change was for the lines that genuinely were logs and not metrics in disguise. I wrote a small wrapper that decides at the call site whether the content is worth emitting. The shape:

# Before, scattered across the codebase:
log.info("connection opened", peer=peer)

# After:
log.info_if(is_interesting(peer), "connection opened", peer=peer)

is_interesting was a few dozen lines of plain if statements that knew which content was routine. Connection-acknowledged messages from a known peer, scheduled-job-started messages on the normal cadence, library-internal heartbeats. Routine cases were not logged. Anything unusual still was.

This is not a sampling layer. I had read about tail-based sampling and considered it. For our traffic shape and the team’s appetite for new infrastructure, the content-aware filter was a smaller change with a bigger payoff, and easier to reason about in review.

What it cost and what it saved

Permalink to “What it cost and what it saved”

For the team I was standing in for, the bill went down by roughly $1000 a month. Not a dramatic number on its own. The interesting part is that the change generalised.

I demoed it at the next technical leadership meeting. A few other tech-leads asked for the util on the same call, and over the following weeks several teams pulled it into their services. Nobody tracked the aggregate. It was several teams times something on the order of hundreds of dollars a month.

What I would not have done is lift the global log level to WARN. It would have hit the bill, sure, and it would also have made the next 3am incident worse. You only find out which logs you cut too aggressively on the day you actually need them.

I also would not have pushed the whole thing to cold storage. That can work for compliance logs, where the read pattern is “produce this on demand for an auditor.” It does not work for operational logs, where the read pattern is “I am debugging right now and I need this in seconds.” The tech-lead’s logs were operational.

The original SRE conversation had been stuck on “logging is too expensive”, and that framing has no resolution to it because it has no agent. The bill is everyone’s problem and therefore nobody’s. The framing that actually worked was “we are paying for log lines that are not worth their cost.” That has a clear owner, which is the team writing the lines. SRE’s job was to make the cost visible. The team’s job was to bring it down.

The bill went down. The team accepted it because nothing they relied on actually disappeared.