A year into Superlinked, I picked up a customer migration that had been declined twice. The customer ran a recommendation system on a Redis vector database, on GCP credits that expired on November 3. Past that date their service would just go dark. Two earlier internal reviews had each concluded the project could not be done in the time available.
I took it on, shipped it on the deadline, and the customer is still with us. This post is about what actually unblocked it.
The setup
Permalink to “The setup”The customer ran a candidate-and-job recommendation platform in the climate-tech space. Their existing stack was a Redis vector database on GCP, ingested from Postgres, with custom application logic on top. The agreed-upon migration plan was that the customer would export their full dataset to JSON files on GCS and then call our ingestion endpoints to populate the new system. We had benchmarked that path already and it would not finish in time.
The hard deadline was the credits expiration. Their engineer was nervous and he had said so. He had every reason to be. The earlier reviews had each looked at the project as a single object, “replicate the existing functionality in four weeks”, and said no. That was actually the right answer to the question they had been asked.
What I gave up trying to fix in time
Permalink to “What I gave up trying to fix in time”A couple of real problems were not going to be solvable before the deadline, and pretending otherwise was the trap I had to avoid.
The first one was ingestion throughput. Our deployment had a known issue with large ingestion payloads. Reducing the batch size to 10 made the pipeline stable but slow, slow enough that running the customer’s full migration through it would have missed the deadline. The underlying fix was real work, not a one-week patch.
The second was the Grafana logging UI. The customer wanted dashboards on top of their ingestion logs in the new deployment. Real value, but not on the critical path for going live. I told them honestly that the UI was not landing that week and offered an interim endpoint they could query for log access. They agreed it was not blocking.
I also negotiated a smaller version of one API. The version they had originally asked for would have taken a week to design properly, and the simpler one covered the case they actually needed. That trade was the easy one. The hard ones were the two above, because telling a customer “we cannot solve the latency problem before your deadline” sounds suspiciously like the start of a missed deadline.
What actually got the ingestion done
Permalink to “What actually got the ingestion done”The unblock was running the migration outside the production deployment.
The customer agreed to give us direct access to their data, instead of going through the export-to-JSON-then-call-our-endpoints path we had originally agreed on. With that access, I wrote one-time migration scripts and deployed them on several GCP instances under our own account, configured at a comfortable rate per instance, running in parallel with retry. The migration ran through that side channel.
The latency bug was real and we did not fix it. We just moved the migration off the path that hit it. The production deployment was still responsible for the steady-state ingestion the customer would do after cutover. The one-time historical migration was a separate pipeline that ran once, on infrastructure I controlled, and got thrown away after.
The customer agreeing to give us that data access was the most important conversation in the whole project. Without it the only path was through our endpoints, and that path did not finish in time.
The final week was bugs
Permalink to “The final week was bugs”By the time the customer started testing in staging, the structural decisions had all been made. The week was small fixes. Several boolean filters had defaults of True instead of None, which produced empty results whenever filters were removed. Two date-filter parameters were typed as strings instead of integers. Production credentials needed rotating and sharing by DM.
Any one of those would have blocked the cutover. They were survivable because the deeper questions had been settled three weeks earlier, and the team was not also rewriting an endpoint or arguing about scope on the same day. We had not promised the deadline would carry every feature. We had promised it would carry the cutover.
What shipped
Permalink to “What shipped”The customer went live in production on November 3 in the evening. The migration completed inside the window we had benchmarked. They sent a thank-you message that night.
The Grafana UI shipped a couple of weeks later. The ingestion bug got the proper fix in the next sprint. The simpler API I had negotiated turned out to be enough, and the more elaborate version never got built.
What I took from it
Permalink to “What I took from it”Two earlier reviews had said the project was infeasible, and they were right about the version of it they had been asked to evaluate. It became feasible because of two things.
The first was deciding, in the first week and in writing, which problems we were not going to solve before the deadline. There is a strong instinct to ship the proper fix and feel bad about proposing a workaround. In our case the workaround was the right answer, and proposing it openly was what actually bought the deadline.
The second was asking the customer to change the agreed-upon migration path so we could run it under our own infrastructure with direct access to their data. That ask is harder to make than it should be, because it sounds like you are asking for a favour. It was not a favour. It was the difference between shipping and not shipping.
There was nothing technically novel in this project, honestly. The hard part was being open about the things I could not do in time, and asking for the access I needed to route around them.