A migration is a UX problem

At TIER my most consequential project was the Datadog to Grafana migration. SRE owned the technical delivery, the new stack, the parity protocol, the cutover. I led adoption, which meant getting engineering across teams onto the new stack without their ability to do their jobs degrading. The migration saved roughly $1M a year. The adoption side is what taught me the most.

The hard part, on my side at least, was that the engineers did not want to migrate.

Why they pushed back

Permalink to “Why they pushed back”

They had spent years in Datadog. They knew where to click during an incident. They had memorised which dashboards opened first, which views had the right axes, where the log facet they always filtered on was, how the time range picker behaved when they dragged across a spike. None of that was written down anywhere. It lived in their fingers.

A new tool, even an objectively well-configured one, costs them all of that. And it costs them in the moment they can least afford to spend it, which is the first ten minutes of an incident. The savings on the bill were not theirs to celebrate. Slower incident response, though, would absolutely be their problem.

I had been treating adoption as a project-management problem at first. SRE implements, parity is verified, teams adjust, I track the dependencies. By the third week of doing it that way what I actually had was a handful of senior engineers who were polite to my face and openly skeptical in their team channels.

Treating it as a UX problem

Permalink to “Treating it as a UX problem”

My wife is a product designer. I had picked up enough from how she talks about her work to recognise that what I was looking at was a UX problem, not a backend one. The engineers using Datadog were the users. The flows they had built were the product. I had been reasoning about the migration like it was infrastructure work, and it wasn’t.

I want to be careful about what I actually did here, because the temptation to dress this up as proper user research is real and I did not run user research. What I did was treat myself as a user. I followed the incident playbooks in Grafana the way an on-call engineer would, and I noticed every step that was slower or more confusing than in Datadog. Then I asked colleagues, casually, the questions that would have been awkward in a scheduled interview but were natural in a Slack DM or at the end of a standup. What do you actually open first, what is missing, how do you do X. Not a study, just paying attention.

What I shipped

Permalink to “What I shipped”

A few artefacts came out of that work.

The most important ones were interactive Grafana dashboards for the most-used views. Built so the metric-to-logs pivot took two clicks, drill-downs stayed inside the dashboard, and the time-range defaults matched what engineers were already used to. I specced them and I built them.

Then there were dashboard templates for teams to fork. Pre-loaded with the right variables, the log-link panels, and the defaults. A team starting from a template did not have to re-derive the shape of a useful dashboard. They forked, renamed, tweaked, and were running.

And finally a set of recorded demos, 1-2 minutes each, walking through “how I would handle X in Grafana.” Watchable on someone’s own time, no meeting required.

What changed when this landed

Permalink to “What changed when this landed”

The pushback eased once the templates and the demos were in front of engineers. They could see that the person running adoption had actually used Grafana the way they would have to, noticed the same friction they would have noticed, and built things to reduce it. The savings on the bill were never going to convince them. The demos that started from “the alert just fired and you have to find out why” did.

I do not think the migration would have made it through the cutover without that change in approach. SRE’s work would have been correct on paper and resented in practice.

The thing I took from this is that adapting to a new tool costs the user time, and it is the user’s time, not yours. If you can absorb part of that cost on their behalf, the migration goes faster. If you cannot, you will get the resistance you have earned.