Nam Nguyen

Design Principles for Building Kubernetes Platforms at Scale

(Or: How I Learned to Stop Worrying and Love the Control Plane)

Kubernetes works great.

Then you add more teams.
Then more clusters.
Then more CRDs.
Then everything catches fire — very politely, and with excellent YAML.

This article is a survival manual.
Not a tutorial.
Not a vendor pitch.
Just a list of things I wished I knew before working with K8s in Hana Cloud teams.


1. Scale Objects, Not Just Nodes

Kubernetes can handle thousands of nodes.

What it cannot handle is you treating it like:

Every Pod, Job, CRD, and ConfigMap is an object. Objects live in etcd. etcd has feelings, and they are fragile.

Rule of thumb:
If your architecture diagram looks like “just add more Pods,” you are already in danger.


2. The Control Plane Is Not Immortal

The control plane is not magic. It is a small number of processes trying very hard not to disappoint you.

Every controller you add:

At scale, this becomes:

“A polite but relentless DDoS you built yourself.”

Stability beats cleverness. Always.


3. One Cluster Will Not Save You

At first, one cluster feels elegant. Later, it feels like putting every organ in one jar.

Eventually:

Multi-cluster isn’t advanced. It’s acceptance.

Design for:


4. Kubernetes Is Not Your Platform

Kubernetes is a control plane.

That’s it. That’s the sentence.

If your developers must understand:

…then congratulations, you’ve successfully outsourced your platform team to Stack Overflow.

Expose intent, not plumbing. Your users are humans, not controllers.


5. Autoscaling Is Not a Wish Granting System

Autoscaling feels amazing right up until it:

CPU is a terrible proxy for “user pain.” Reactive scaling is always late. Unbounded scaling is just a slow outage.

Autoscaling needs:


6. Networking Will Betray You Quietly

Networking problems do not scream.

They whisper. They degrade. They wait for peak traffic.

Then everything times out simultaneously.

Be careful with:

Networking is not configuration. It is capacity planning with consequences.


7. Scheduling Is a Physics Problem

The scheduler is smart. Physics is smarter.

At scale you will get:

Perfect bin-packing is impossible. Accept this. Design for interruption. Make peace with preemption.


8. Observability Can Kill the Patient

Metrics are great. Logs are great. Traces are great.

All of them at once, at full resolution, forever — not great.

High cardinality will bankrupt you. Dashboards will lie to you. Alerts will be ignored.

At scale:

More data does not mean more understanding. Ask any Mars rover.


9. Governance Must Be Mostly Invisible

If governance slows people down, they will go around it. If it blocks them, they will remove it. If it surprises them, they will hate you.

Good governance feels like:

Defaults beat policies. Early failure beats runtime enforcement. Clarity beats cleverness.


10. CI/CD Is Part of Production

Deployments generate load. They contend for resources. They fail in exciting new ways.

At scale:

Your cluster doesn’t care whether load comes from users or CI. Neither should you.


11. Cost Is a Technical Signal

Kubernetes hides cost exceptionally well. This is both impressive and alarming.

Requests, limits, replicas, and autoscaling are economic decisions. If engineers cannot see cost, they cannot optimize it.

Efficiency is not a finance problem. It is a platform design problem.


12. Platform Design Is Organizational Design

Kubernetes will scale your systems faster than your teams.

Without:

…your platform team becomes mission control during a meteor shower.

Your platform is a product. Your users are engineers. Design like it matters — because it does.


Final Notes from Orbit

Kubernetes at scale is not about YAML. It is not about tools. It is not about clever operators.

It is about:

If you do it right, Kubernetes becomes boring.

And boring is success.