From Code to Architecture: Lessons from Six Years of Shipping
Five patterns I keep reaching for when systems get real — circuit breakers, idempotency, retries, observability, and the discipline of measuring before you ship.
- #architecture
- #microservices
- #production
- #patterns
The hardest thing about backend engineering isn't writing code. It's deciding what code not to write — and what discipline to add around the code you do.
I've spent six and a half years building production systems across telecom, fintech, govtech and edtech. The languages changed. The frameworks changed. The cloud providers changed. Five patterns kept showing up.
1. The boundary between services is a contract, not a suggestion
Every microservice migration I've seen go badly started with the same mistake: treating service boundaries as code-organization, not contract-organization.
The test is simple. If a downstream team can ship a breaking change without your service knowing about it, you don't have a contract. You have an internal call masquerading as a network call — the worst of both worlds. You pay the latency cost of HTTP and the coupling cost of a shared module.
What it looks like when it's right:
- Schemas live in a versioned, language-agnostic format (OpenAPI, protobuf)
- Breaking changes get a
/v2, never a same-version mutation - Consumers can pin a version; producers can't yank one out from under them
- A test suite proves you serialize what you say you serialize
This is boring. It's also the reason your platform doesn't catch fire on a Friday.
2. Idempotency is a feature, not a hope
Networks fail. Retries happen. The question isn't whether your service will see the same request twice. The question is whether the second time hurts.
The cheapest implementation is a request-id header that the caller generates and the server caches the result against. If the same id arrives twice, the server returns the cached response and skips the side effect. Costs you a Redis key and a hash check. Saves you a duplicate charge or a duplicate row.
I've watched teams without this build elaborate compensation flows to undo what the second request did. They never work as well as not doing the work twice in the first place.
3. Retries are a contract too
Retries between services without a budget are how cascading failures start. Service A retries B three times, B retries C three times — that's nine attempts on C from one logical call. C is already struggling; now it's seeing 9× load.
Two rules I won't break:
- Bound the retry budget end-to-end, not per-hop. Pass a deadline header. Each hop checks it before retrying.
- Circuit-break aggressively. After N failures in a window, stop calling. Surface a 503 fast. The system recovers faster from a clean failure than from slow degradation.
Spring Cloud's circuit breaker (Resilience4j under the hood) does this in three lines of config. Most Node teams hand-roll something with a global counter and call it a day. Both work; the question is whether you're explicit about the budget or just hoping.
4. Observability before incidents, not during
The single biggest predictor of how a system survives an incident is whether the operator can see what's happening right now. Not yesterday's logs. Now.
The minimum viable trio:
- Structured logs with a request-id propagated through every service touched
- One latency dashboard with p50 / p95 / p99 per route, refreshed in seconds
- Error rate per dependency, not just per service — so you know which downstream is the problem before users do
Add this before you have the incident. The pattern I've seen kill teams: rolling out fancy distributed tracing two weeks after a P0, while the lessons are still in postmortem format.
5. If it doesn't move a metric, don't ship it
Every system I've owned shipped with a number attached. Throughput +30%. Manual processing −70%. Loan approval time −50%. 35+ enterprise integrations live.
Not because metrics make engineers feel important. Because the discipline of choosing a metric before you start coding forces you to know what "done" means. Every one of those numbers existed in a spec before the first line of code did. The code was the cheap part. The agreement on the metric was the expensive part.
The corollary: if you can't articulate the metric, you don't understand the work yet. Go back and ask.
Six years compresses to this: code is the easy part. The hard part is the discipline you build around the code — boundaries, idempotency, retries, observability, metrics — that lets the code keep working when the world gets noisy.
Everything else is a tool to serve that.