Star Computers

Cloud, systems, and the plumbing in between

A field guide to the seams between cloud providers, networks, Linux hosts, and internal tooling — where operational reliability is actually built.

Star Computers

Most infrastructure problems are not caused by the cloud provider, the container runtime, or the orchestrator. They are caused by the seams between them. This post walks through the layers where real operational reliability is won or lost — from hybrid connectivity down to the tooling your team builds for itself.

Cloud architecture and hybrid/multi-cloud connectivity

A sound cloud architecture starts by accepting that almost nobody runs a single-cloud environment anymore. Production workloads usually sit across a primary cloud, a secondary cloud for redundancy or specialized services, on-premises systems that refuse to retire, and a stack of SaaS platforms wired in through APIs and identity.

Two patterns dominate, and they are not interchangeable:

  • Hybrid cloud blends private infrastructure with public cloud, typically because of data gravity, regulatory pressure, or legacy systems that cannot be lifted cleanly.
  • Multi-cloud spreads workloads across multiple public providers to avoid lock-in, pick best-of-breed services, or meet regional availability needs.

What actually makes connectivity work

The networking layer is where most hybrid and multi-cloud designs quietly fail. A clean design usually includes:

  • Private interconnects or dedicated links instead of public-internet VPNs for anything latency-sensitive.
  • A consistent IP addressing plan across clouds and on-prem, with no overlapping CIDRs — ever.
  • Centralized egress and inspection points so traffic flow is predictable and auditable.
  • A transit-hub model (cloud-native or appliance-based) rather than a sprawl of point-to-point tunnels.
  • Identity federation so engineers and workloads do not accumulate duplicate credentials across providers.

Opinionated guidance

Design for the 90% case, not the 10%. If one provider hosts most workloads, put the control plane there and treat the rest as spokes. Symmetrical multi-cloud architectures sound elegant in diagrams and cost a fortune in operational complexity.

Systems administration: Linux, virtualization, and containers

Cloud or not, Linux is still the substrate. Good systems administration has not changed as much as the industry likes to pretend — it has just moved up the stack.

Linux fundamentals that still matter

  • Know your init system, your log paths, and your package manager cold. Debugging starts there.
  • Treat time synchronization, DNS resolution, and certificate trust as first-class concerns. Most “weird” outages trace back to one of the three.
  • Build images, do not configure servers. A base image plus minimal runtime config is far easier to reason about than a server patched in place for three years.

Virtualization

Hypervisors are still the foundation of private cloud and most on-prem stacks. Keep clusters boring: consistent hardware generations, predictable storage tiers, and separation between management, storage, and workload networks. Snapshots are not backups, and live migration is not a disaster recovery strategy.

Containers

Containers are excellent at packaging and terrible at pretending to be VMs. Use them for what they are good at:

  • Immutable, reproducible application runtimes.
  • Short-lived workloads that scale horizontally.
  • Clear separation between image (build-time) and configuration (run-time).

Run containers with sensible defaults: non-root users, read-only filesystems where possible, resource limits on every workload, and minimal base images. A container that cannot survive being killed at any moment is not production-ready.

Networking fundamentals for engineers

Cloud engineers who do not understand networking end up building very expensive mistakes. The primitives are the same as they were twenty years ago.

VLANs and segmentation

Flat networks are a liability. Segment by trust level and blast radius, not by org chart. A reasonable baseline:

  • Separate management, production, and user networks.
  • Isolate anything internet-facing from anything internal by default.
  • Put sensitive systems (backups, identity, secrets, observability) on their own segments with tightly controlled ingress.

Microsegmentation is worth the effort where it matters — east-west traffic inside a data center or VPC is where attackers move once they are inside.

DNS

DNS is both the easiest thing to ignore and the most common cause of multi-hour outages. Run internal DNS with redundancy, short-but-not-zero TTLs for things that change, and monitoring on resolution latency and NXDOMAIN rates. Split-horizon setups deserve documentation that a tired on-call engineer can actually follow at 3 a.m. For a deeper take, see the DNS propagation myth.

Reverse proxies

A good reverse-proxy layer does more than route traffic. It centralizes TLS termination, adds consistent logging, enforces rate limits, and gives you a single place to shed load when something downstream is struggling. Keep the configuration declarative and version-controlled.

VPNs

Site-to-site VPNs are fine for connecting networks. User VPNs are increasingly a liability — they grant broad network access when what users actually need is access to specific applications. Zero-trust access models replace the flat VPN with per-application authorization and are worth the migration effort.

Infrastructure as code best practices

IaC is now table stakes. Done well, it makes infrastructure reviewable, repeatable, and recoverable. Done poorly, it becomes a second codebase nobody understands.

What “done well” looks like

  • Version control everything. No exceptions, including the small stuff.
  • Modular design. Networks, compute, storage, and identity should be composable modules, not one monolithic template.
  • Separate state per environment. One blast radius per state file, not a single global state.
  • Plan before apply. Every change goes through a reviewed diff.
  • Drift detection. Detect out-of-band changes automatically and reconcile them deliberately.
  • Secrets stay out. Use a secrets manager, never plaintext variables. See a pragmatic approach to AWS IAM policies for a related take on the identity side.

Anti-patterns to avoid

  • Wrapping every resource in a custom abstraction “for flexibility.” You are building a framework nobody asked for.
  • Hand-editing resources and promising to fix the code “later.”
  • Using IaC for one-off imperative tasks it was never meant to do.

The test of good IaC is simple: can a new engineer destroy and rebuild a non-production environment from scratch without calling anyone? If not, the code is not finished.

Observability that surfaces real issues

Most observability setups fail not because they collect too little, but because they collect too much of the wrong things and alert on all of it.

Reduce noise, increase signal

  • Alert on symptoms, not causes. A disk at 80% is not an incident. A user-facing endpoint returning errors is.
  • Every alert must be actionable. If there is no runbook and no action, it is not an alert — it is a dashboard widget.
  • Kill duplicate alerts. One incident should not generate forty notifications across five channels.
  • Set SLOs and alert on burn rate, not on raw thresholds.
  • Review alerts quarterly. Delete the ones nobody acted on.

The three signals

Logs, metrics, and traces each answer a different question: what happened, how much, and where. Treat them as complements, not substitutes. Structured logs, cardinality-conscious metrics, and sampled traces beat firehoses of unstructured data every time.

Operations: backups, patching, change control, incident response

This is the unglamorous work that decides whether your infrastructure is dependable.

Backups

A backup you have not restored is a hope, not a backup. The rules are boring and correct:

  • Follow a 3-2-1 pattern: three copies, two media, one offsite.
  • Test restores on a schedule, not when you need them.
  • Protect backups from the same credentials that run production. Ransomware resilience depends on it.

Patching

Patching should be routine, not heroic. Tier systems by risk, patch lower tiers automatically, and gate higher tiers through a change window. Track mean time to patch for critical CVEs — it is one of the most honest indicators of operational health.

Change control

Heavy-handed change control kills velocity. No change control kills uptime. A reasonable middle is:

  • Lightweight peer review for low-risk, reversible changes.
  • Documented change windows for higher-risk work.
  • Clear rollback plans before anything ships.
  • A post-change verification step that is not “it looked fine.”

Incident response

Good incident response is rehearsed, not improvised. The essentials:

  • A clear on-call rotation with defined escalation paths.
  • A single incident commander per incident — not a committee.
  • Communication cadence separate from technical investigation.
  • Blameless postmortems that produce concrete, assigned action items.
  • A tracked backlog of those action items, actually completed.

Automation without new tech debt

Automation is not inherently good. Bad automation is worse than a manual process because it fails silently, at scale, and at 2 a.m.

Principles that hold up

  • Automate what is repetitive, well-understood, and low-variance. Novel work should stay manual until it is understood.
  • Make automation observable. Every automated action should be logged, attributable, and reviewable.
  • Build in guardrails. Rate limits, dry-run modes, and scoped permissions are not optional.
  • Keep automation small and composable. One tool, one job.
  • Document the failure modes, not just the happy path.
  • Retire automation deliberately. Dead scripts become future incidents.

The goal is not to eliminate human judgment. It is to remove the repetitive work so humans can apply judgment where it matters.

Small, focused internal tooling

Every mature platform team eventually builds internal tools. The ones that succeed share a few traits.

What good internal tools look like

  • They solve one problem well. A tool that lists stale resources and offers to clean them up beats a “platform portal” that tries to do everything.
  • They respect existing workflows. Engineers already use terminals, chat, and pull requests. Meet them there.
  • They are boring to operate. Internal tools that require their own on-call rotation are a net loss.
  • They have a clear owner. Orphaned internal tools rot faster than external ones.
  • They are easy to delete. If the tool stops being useful, it should be trivial to remove.

Examples worth building

  • A resource inventory that reconciles what is deployed with what is in IaC.
  • A cost allocation report that maps spend to teams and services.
  • An on-call handoff summary that pulls recent incidents, alerts, and deploys into one view.
  • A pre-flight checker for common misconfigurations before deploys.
  • A one-command environment teardown for ephemeral test environments.

Resist the urge to turn internal tools into products. They exist to serve the team, not to be demoed.

Closing

Cloud platforms get the spotlight, but the work that keeps real systems running lives in the layers between them — the networks, the automation, the backups, the tooling, the quiet operational discipline that nobody posts about. That middle ground is where reliability is actually built.

This blog will keep digging into that space: cloud, systems, and the plumbing in between.

Frequently asked questions

What is cloud infrastructure "plumbing"?
The layers between the big-ticket cloud services — networking, identity, DNS, TLS, IaC, observability, backups, change control — where most outages actually originate. Cloud providers rarely fail; the seams between them do.
What is the difference between hybrid cloud and multi-cloud?
Hybrid cloud combines private infrastructure (on-premises or colocated) with public cloud, usually because of data gravity, regulation, or legacy systems. Multi-cloud spreads workloads across multiple public providers to avoid lock-in, pick best-of-breed services, or meet regional availability requirements. They solve different problems and are not interchangeable.
What makes hybrid and multi-cloud networking reliable?
Private interconnects instead of public-internet VPNs for latency-sensitive traffic, a consistent non-overlapping IP plan across providers, centralized egress and inspection, a transit-hub topology rather than point-to-point tunnels, and identity federation so workloads and engineers do not accumulate duplicate credentials.
What is the biggest mistake teams make with Infrastructure as Code?
Wrapping every resource in a custom abstraction "for flexibility," which turns IaC into a second codebase nobody understands. Good IaC is modular, version-controlled, has one state file per environment, goes through reviewed plans before apply, and detects drift automatically. If a new engineer cannot destroy and rebuild a non-production environment without calling anyone, the code is not finished.
How should alerts be designed for real observability?
Alert on user-visible symptoms, not on resource-level causes. Every alert must be actionable and have a runbook — otherwise it is a dashboard widget. Prefer SLO burn-rate alerts over raw thresholds, deduplicate alerts so one incident does not generate forty notifications, and review alerts quarterly to delete the ones nobody acted on.
How do you avoid automation that becomes technical debt?
Automate only work that is repetitive, well-understood, and low-variance. Make every automated action logged, attributable, and reviewable. Build in guardrails — rate limits, dry-run modes, scoped permissions. Keep each tool small and single-purpose, document the failure modes, and retire dead automation deliberately before it becomes a future incident.

Welcome to the new Star Computers

Why we rebuilt the site on Astro, how the new blog works, and what to expect from upcoming posts on cloud and systems engineering.

The DNS 'propagation' myth and what actually happens

DNS records don't 'propagate' — they expire from caches. Here's what's really going on when you wait for a record to update, and how to make changes faster.