Nam Nguyen

Design Principles for Building Kubernetes Platforms at Scale

(Or: How I Learned to Stop Worrying and Love the Control Plane)

Kubernetes works great.

Then you add more teams.
Then more clusters.
Then more CRDs.
Then everything catches fire — very politely, and with excellent YAML.

This article is a survival manual.
Not a tutorial.
Not a vendor pitch.
Just a list of things I wished I knew before working with K8s in Hana Cloud teams.

1. Scale Objects, Not Just Nodes

Kubernetes can handle thousands of nodes.

What it cannot handle is you treating it like:

a database
a queue
a workflow engine
a diary of your feelings (via Events)

Every Pod, Job, CRD, and ConfigMap is an object. Objects live in etcd. etcd has feelings, and they are fragile.

Rule of thumb:
If your architecture diagram looks like “just add more Pods,” you are already in danger.

2. The Control Plane Is Not Immortal

The control plane is not magic. It is a small number of processes trying very hard not to disappoint you.

Every controller you add:

watches things
reconciles things
writes things back

At scale, this becomes:

“A polite but relentless DDoS you built yourself.”

Stability beats cleverness. Always.

3. One Cluster Will Not Save You

At first, one cluster feels elegant. Later, it feels like putting every organ in one jar.

Eventually:

upgrades take weeks
failures are mysterious
everyone is afraid to touch anything

Multi-cluster isn’t advanced. It’s acceptance.

Design for:

sharding
partial failure
the fact that clusters will drift like continents

4. Kubernetes Is Not Your Platform

Kubernetes is a control plane.

That’s it. That’s the sentence.

If your developers must understand:

Pods
Services
Ingress
RBAC
NetworkPolicies

…then congratulations, you’ve successfully outsourced your platform team to Stack Overflow.

Expose intent, not plumbing. Your users are humans, not controllers.

5. Autoscaling Is Not a Wish Granting System

Autoscaling feels amazing right up until it:

scales too late
scales too much
scales the wrong thing
empties your cloud budget

CPU is a terrible proxy for “user pain.” Reactive scaling is always late. Unbounded scaling is just a slow outage.

Autoscaling needs:

limits
priorities
adult supervision

6. Networking Will Betray You Quietly

Networking problems do not scream.

They whisper. They degrade. They wait for peak traffic.

Then everything times out simultaneously.

Be careful with:

per-service load balancers
“mesh everywhere”
unlimited east-west traffic

Networking is not configuration. It is capacity planning with consequences.

7. Scheduling Is a Physics Problem

The scheduler is smart. Physics is smarter.

At scale you will get:

fragmentation
wasted capacity
weird node shapes
workloads that fit nowhere

Perfect bin-packing is impossible. Accept this. Design for interruption. Make peace with preemption.

8. Observability Can Kill the Patient

Metrics are great. Logs are great. Traces are great.

All of them at once, at full resolution, forever — not great.

High cardinality will bankrupt you. Dashboards will lie to you. Alerts will be ignored.

At scale:

sample aggressively
aggregate early
pick a few signals that actually matter

More data does not mean more understanding. Ask any Mars rover.

9. Governance Must Be Mostly Invisible

If governance slows people down, they will go around it. If it blocks them, they will remove it. If it surprises them, they will hate you.

Good governance feels like:

“this just works”
“why would I do it another way?”

Defaults beat policies. Early failure beats runtime enforcement. Clarity beats cleverness.

10. CI/CD Is Part of Production

Deployments generate load. They contend for resources. They fail in exciting new ways.

At scale:

synchronized deploys are dangerous
rollbacks are slow
GitOps controllers need tuning like any other workload

Your cluster doesn’t care whether load comes from users or CI. Neither should you.

11. Cost Is a Technical Signal

Kubernetes hides cost exceptionally well. This is both impressive and alarming.

Requests, limits, replicas, and autoscaling are economic decisions. If engineers cannot see cost, they cannot optimize it.

Efficiency is not a finance problem. It is a platform design problem.

12. Platform Design Is Organizational Design

Kubernetes will scale your systems faster than your teams.

Without:

clear ownership
good defaults
boring documentation

…your platform team becomes mission control during a meteor shower.

Your platform is a product. Your users are engineers. Design like it matters — because it does.

Final Notes from Orbit

Kubernetes at scale is not about YAML. It is not about tools. It is not about clever operators.

It is about:

respecting limits
designing boundaries
and staying calm while everything is technically on fire

If you do it right, Kubernetes becomes boring.

And boring is success.