Ten minutes to two seconds

Two years ago, when I joined Superlinked, the main example notebook for the Superlinked framework took ten minutes to run end-to-end. It was the notebook we pointed new users at, we called it the netflix example. Above the slow cell an MLE had embedded a 10-minute music video so users on Google Colab had something to do while they waited. By the end of all the optimisation work, the cell ran in two seconds and the video was gone.

The notebook was the benchmark, but the same code paths ran in production against real customer data. So the optimisation was for users, not just for our demo.

The optimisations happened over the better part of a year. The engineering was not really the hard part. Convincing the rest of the team that this was worth meaningful effort was. Most of what follows is technical, but the order in which I shipped fixes was driven as much by what the team would believe next as by what the profiler said next.

The benchmark code itself did not get faster. Most of it did not change at all. What changed was a sequence of small fixes, each one driven by re-running the profiler and going after whatever was on top.

I had no strong predictions going in. I just opened the profile and started fixing the largest visible cost.

The workload

Permalink to “The workload”

The notebook ingests a synthetic dataset through the framework’s DAG: parse JSON inputs, push them through several embedding-model nodes, combine the resulting vectors in downstream nodes, store them. The same shape of workload, on different data, ran in production.

python -m cProfile -o ingest.prof notebooks/example_ingestion.py
python -m pstats ingest.prof <<< "sort cumulative\nstats 30"

Iteration 1: a cache, before any restructuring

Permalink to “Iteration 1: a cache, before any restructuring”

The quickest win, and the one I did first because I needed a quick win. Several DAG nodes were re-deriving the same intermediate values every time a downstream node asked for them. A small per-run memoisation layer, keyed by node identity and input hash, removed the duplicate calls.

A few hours of work and a chunk off the runtime. This was as much a political move as a technical one. Going from “Krisztian thinks this is slow” to “Krisztian shipped a measurable improvement, here is the new number, here is the diff” changes the conversation a lot. The structural rewrites later on (iterations 4 and 5) would have been a hard sell on day one. After the team had already reviewed and merged a string of smaller wins, they were approvable.

Iteration 2: jsonpath-ng to ujson

Permalink to “Iteration 2: jsonpath-ng to ujson”

The biggest single win in the whole project. The framework was using jsonpath-ng to load and access fields in the input JSON. jsonpath-ng is a JSONPath query library that compiles path expressions and walks the parsed tree against them, which involves a fair amount of regex on the path strings.

It had been added before I joined, back when the team genuinely needed JSONPath queries to pull fields out of nested, irregular customer payloads. By the time I got there the input shape had stabilised and the code was just reaching for fields by known names. The features were not being used any more. The dependency still was. Nobody had revisited it.

I swapped the load path to ujson and accessed fields directly. A few-line change. Several minutes off the runtime.

The reason a library gets pulled in usually goes away long before the library itself does. They have to be audited deliberately.

Iteration 3: model loading, per-node to shared registry

Permalink to “Iteration 3: model loading, per-node to shared registry”

The framework loads embedding models inside DAG nodes. For historical reasons, several nodes that needed the same model were each loading their own copy. Same weights, multiple times, in sequence.

The fix was to introduce a shared model registry with load-once / reuse semantics, and to take the model-loading responsibility off the individual nodes. The nodes now ask the registry for a handle.

Obvious in hindsight. It existed because the DAG was originally written with one model and then grew without anyone revisiting that assumption.

Iteration 4: batch evaluation across the DAG

Permalink to “Iteration 4: batch evaluation across the DAG”

Once the small fixes stopped reshaping the profile, the structural one became visible. The DAG was being walked node-by-node with each node processing one input at a time. Embedding models, like most ML inference paths, are much more efficient on batches.

Before I proposed restructuring the evaluator, I shipped a smaller version of the same idea: a 100-line async request coalescer that held individual model calls for a few milliseconds, accumulated whatever arrived, and dispatched them as a single batch. Callers still wrote await evaluator.evaluate(item) one at a time. The coalescer turned that into batched calls underneath. It was a contained change with no API impact and it gave us most of the batching throughput win without touching the evaluator. Importantly for what came next, it gave the team a number that proved batching mattered before I asked them to sign off on a structural rewrite.

After the coalescer had been in production long enough to be uncontroversial, I rewrote the evaluator to collect inputs along a level of the DAG and dispatch them to each node in batches. Several hundred lines of change, the largest behavioural change in the framework’s runtime, and a substantial chunk off the runtime on its own. The coalescer was retired in the same release: with the evaluator handling batching natively, the standalone primitive was no longer pulling its weight.

More importantly, the rewrite set up iteration 5.

Iteration 5: deduplicate within a batch

Permalink to “Iteration 5: deduplicate within a batch”

With batched evaluation in place, the profile showed a new pattern. The same inputs were being encoded multiple times within a single batch, because different DAG nodes happened to feed the same value into the same model.

The fix was to deduplicate at batch-dispatch time. Collect what each node wanted encoded. Take the union. Encode each unique value once. Fan the results back out to the nodes that asked for them.

This was not possible before iteration 4. Batching had to exist before deduplication of batched inputs had anywhere to live. This is the only “each iteration unlocked the next” claim I would actually defend on this project. The others were independent.

Iteration 6: repr in the log path

Permalink to “Iteration 6: repr in the log path”

The remaining time was inside the logger. structlog calls str() or repr() on every value passed into a log call. Some of our DAG nodes carried values whose default __repr__ was expensive, and those __repr__ calls were running even when the log message was below the active level and would be dropped.

I wrote about the lazy-evaluation processor that solves the general case in PII-aware logging in structlog. For this notebook specifically, I also rewrote a few of the framework’s own custom __repr__ methods to be cheap by default and only do the expensive thing when something explicitly asked for it.

A smaller win than iterations 2 or 4, but it cleaned up the long tail.

What the order actually taught me

Permalink to “What the order actually taught me”

Two lessons, one technical and one political, and the political one was harder.

The technical one first. The original profile pointed at JSON parsing and model loading as the two largest costs. Both were real and both got fixed. What I would not have predicted from that first profile is that batched evaluation across the DAG (iteration 4) was a bigger win than fixing model loading (iteration 3), and deduplication within batches (iteration 5) was not visible at all until batching existed. The right strategy turned out to be “fix the largest visible cost, then re-profile.” The plan I would have written from the original profile would have stopped after iteration 3 and missed the two biggest structural wins. The optimisation plan you write at the start is just wrong by iteration 3.

The political one I had actually learned on an earlier project. I had been delivering my planned work and shipping on time, and on the side I was building a substantial rewrite of a system I thought needed it. When I opened the PR, the code was good. The team’s reaction was not. The problem was not that I had been absent, I had not been. The problem was that significant design decisions had been happening in the background that nobody had a chance to push back on, and a finished rewrite is much harder to argue with than a proposal. People felt routed around. The fact that the rewrite was technically correct made the social damage worse, not better. There is no benchmark that recovers from “you did not tell me you were doing this.”

So this time I did the inverse. Visible quick wins first, each as a normal PR with a normal review and a number the team could watch move. The structural rewrites came after several smaller wins had already shipped. Each visible win compounded the case for the next change. By the time I proposed reworking the evaluator, I was not pitching a rewrite, I was just extending a sequence the team had been part of for months.

A 300x speedup looks heroic in a one-line summary. The actual experience was unglamorous and slower than it sounds. Six small-to-medium PRs across the better part of a year, none of which required cleverness, each driven by whatever the profiler had on top and by what the team was ready to merge next. What needed thought was the order, in both senses of the word.