Generally good things are generally good

The project is the Superlinked Platform. The multi-service system that hosts vector-search use cases for our customers. A few services matter for this post: change-api is the ingestion edge, the consumers read from Kafka and run customer-specific indexing logic, and query-api serves search. ML engineers write the indexing and query code per customer, integrators wire it into the customer’s existing systems, and most of the design tension on the platform lives between those two roles.

We had a service called change-api, a thin HTTP layer in front of Kafka that integrators use to push items into the platform. From the start we had kept it customer-agnostic. The payload was a dict[str, Any] with a type_ field, and adding a new customer never required redeploying change-api. This is the textbook good idea. Generic producers, customer logic in the consumer, fewer moving pieces.

And it was wrong for us. Integrators on customer projects kept hitting the same wall. They could not tell from the OpenAPI spec what payload was expected, because there were no schemas at the edge, just a dict. They would send something, the API would accept it, and then the consumer would fail somewhere downstream with a validation error they had no real-time visibility into. To learn the contract they had to either read the consumer source for the exact version we had deployed, or fire payloads at it and reverse-engineer the errors. Neither of those is a thing you should have to do to integrate.

I wrote an RFC that undoes the original design. change-api now imports the customer’s schema and validates at the edge. It is no longer customer-agnostic. Adding a customer means redeploying change-api. We took that cost on purpose, because the alternative was paid by every integrator on every change, and the integrators were the people we were trying to help.

“Keep producers generic” is correct in most contexts. Ours was not most contexts. We had two customers, a small platform team, and integrators who needed the schema visible at the edge a lot more than we needed deployment independence from them. Adopting the principle without checking which side paid the cost was the mistake.

The same RFC also goes in the other direction in one place. The request-reply piece, where an integrator wants to optionally wait for the result of a push, is the kind of thing a team can talk itself into building from scratch. Correlation IDs, reply topics, timeout handling, ack semantics. A week of work and then maintenance forever. We used FastStream, which has the pattern built in. The example in their docs basically fits our use case as-is. Nothing custom.

The instinct to write it ourselves is always there. “Our case is slightly different” is true and almost never matters. FastStream handles the 95% and we live with the 5%, which is much cheaper than owning 100% of a Kafka RPC abstraction we would then have to debug at 2am.

Same project, opposite calls in the same document. The producer-agnostic principle was generally good and wrong for us, the Kafka library was generally good and right for us, and the only way I know to tell those apart is to actually go and look at who pays the cost in your specific situation when the principle holds, and who pays it when it doesn’t. The principle does not tell you that. It cannot.