What You’ll Learn: How to build a real Kubernetes cluster that can run production workloads, not just hello-world demos.
Prerequisites: You’ll need 7-8 machines (VMs or physical servers). This guide assumes you can provision these yourself—whether that’s creating VMs in Proxmox/ESXi, spinning up cloud instances, or using physical hardware. See specs below.
Why RKE2: It’s Kubernetes without the complexity overhead of managed services. You control everything. You understand everything. When something breaks at 2am, you know exactly where to look.
Time Investment: 2-3 days if you follow along, a week if you include chaos testing. Several weeks of trial and error if you figure it out yourself from scratch.
Table of Contents
Setup & Prerequisites
Building the Cluster
- Part 1: Preparing Your Nodes
- Part 2: Installing HAProxy Load Balancer
- Part 3: Installing RKE2 – First Control Plane Node
- Part 4: Adding Additional Control Plane Nodes
- Part 5: Adding Worker Nodes
- Part 6: Installing Rancher Management Platform
- Part 7: Installing MetalLB for LoadBalancer Services
Deploying Applications
Advanced Topics
Wrap-Up
Let’s build something real.
Prerequisites: What You Need Before Starting
Hardware Requirements:
Absolute Minimum (will work, but tight):
- Control Plane Nodes (3x): 2 vCPU, 4GB RAM, 40GB disk
- Worker Nodes (3x): 4 vCPU, 8GB RAM, 80GB disk
- HAProxy Load Balancer (1x): 1 vCPU, 1GB RAM, 20GB disk
- Total: 7 machines, 17 vCPU, 37GB RAM, 380GB disk
Recommended (what I actually run):
- Control Plane Nodes (3x): 4 vCPU, 8GB RAM, 80GB disk
- Worker Nodes (4x): 6-8 vCPU, 16-24GB RAM, 100-200GB disk
- HAProxy Load Balancer (1x): 2 vCPU, 2GB RAM, 20GB disk
- Total: 8 machines, 38-46 vCPU, 106-122GB RAM, 700-1,000GB disk
Why these specs? Control planes idle at 2-3GB RAM with Rancher installed. Workers consume 8-12GB with monitoring stack running. The recommended specs give you room to grow—you WILL add more workloads over time.
CRITICAL: Storage Performance Matters
Use SSDs or NVMe. Not HDDs.
etcd (the control plane database) is EXTREMELY sensitive to disk latency. On HDDs:
- Control plane nodes can take hours to stabilize (I’ve left clusters overnight)
- API server flickers constantly (nodes show Ready, then NotReady, then Ready again)
- Cluster feels sluggish and unreliable
- You’ll think something is broken (it’s just slow—patience required)
On SSDs/NVMe:
- Control plane nodes stabilize in 2-3 minutes
- Reliable, predictable performance
- Cluster feels responsive
I learned this the hard way. Spent hours debugging a “broken” cluster that was just running on spinning rust. Moved to SSDs, problems vanished.
Give your cluster time to stabilize: Even on SSDs, after first boot, wait 5-10 minutes before panicking. etcd needs to form quorum, CNI needs to initialize, pods need to schedule. Patience pays off.
Minimum Non-HA Setup (for learning only):
- Control Plane (1x): 4 vCPU, 8GB RAM, 80GB SSD
- Worker Nodes (2x): 6 vCPU, 16GB RAM, 100GB SSD
- Total: 3 machines, 16 vCPU, 40GB RAM, 280GB disk
What you lose:
- No HAProxy (single control plane = single point of failure)
- Control plane dies → entire cluster dies
- No etcd redundancy (quorum requires 3 nodes)
- Can’t do chaos testing (killing CP1 takes down everything)
- Not production-ready, learning-only setup
This guide assumes you want HA. If you’re just learning Kubernetes basics, start with k3s on 1-2 nodes instead.
Software:
- OS: SUSE Leap 15.6 (officially supported) or Ubuntu 22.04/24.04 LTS
- SSH access to all nodes
- Static IP addresses configured
- Root or sudo access
Note on SUSE Leap versions: Leap 16.0 was released with a significantly improved installation process—much more modern than 15.6’s outdated installer. However, RKE2 doesn’t officially support Leap 16 yet. I tested both in production. Leap 16 worked initially but hit compatibility issues with some RKE2 components during upgrades. Stick with 15.6 for production clusters. The ancient installer is annoying, but cluster stability matters more than installation UX.
Note on Ubuntu: 24.04 LTS works fine and is what I run in production for docker-compose deployments. For RKE2, both 22.04 and 24.04 are solid choices.
Network Setup:
192.168.1.121-123 → Control plane nodes
192.168.1.124-127 → Worker nodes
192.168.1.128 → HAProxyCritical: Reserve these IPs in your DHCP server. Kubernetes hates when IPs change.
Setting Up Your Local Machine for Remote Management
Before touching any nodes, set up your local machine so you can manage everything remotely. SSHing into nodes every time is tedious and error-prone.
SSH Key-Based Authentication
Generate SSH keys and copy them to all nodes:
# On your local machine (Mac/Linux)
# Generate key if you don't have one
ssh-keygen -t ed25519 -C "your-email@example.com"
# Copy key to all nodes (repeat for each IP)
ssh-copy-id root@192.168.1.121
ssh-copy-id root@192.168.1.122
ssh-copy-id root@192.168.1.123
ssh-copy-id root@192.168.1.124
ssh-copy-id root@192.168.1.125
ssh-copy-id root@192.168.1.126
ssh-copy-id root@192.168.1.127
ssh-copy-id root@192.168.1.128 # HAProxy
# Test passwordless SSH
ssh root@192.168.1.121 'hostname'Why this matters: You’ll run commands on these nodes hundreds of times. Typing passwords is hell.
Configure kubectl on Your Local Machine
After your first control plane is up (you’ll do this in Part 3), copy the kubeconfig to your local machine:
# On your local machine
mkdir -p ~/.kube
# Copy kubeconfig from first control plane
scp root@192.168.1.121:/etc/rancher/rke2/rke2.yaml ~/.kube/rke2-config
# Edit the config to use HAProxy IP instead of localhost
sed -i 's/127.0.0.1/192.168.1.128/g' ~/.kube/rke2-config
# Set KUBECONFIG environment variable (add to ~/.bashrc or ~/.zshrc)
export KUBECONFIG=~/.kube/rke2-config
# Test it
kubectl get nodesNow you control your entire cluster from your laptop. No more SSHing into control plane nodes.
Handy kubectl Commands (Bookmark This)
You’ll use these constantly:
# Node management
kubectl get nodes # List all nodes
kubectl get nodes -o wide # Show IPs and versions
kubectl describe node rke2-cp-01 # Detailed node info
kubectl top nodes # Resource usage per node
# Pod management
kubectl get pods -A # All pods in all namespaces
kubectl get pods -n zantu # Pods in specific namespace
kubectl get pods -o wide # Show which node pods run on
kubectl describe pod <pod-name> -n zantu # Debug pod issues
kubectl logs -f <pod-name> -n zantu # Follow pod logs
kubectl logs <pod-name> -n zantu --previous # Logs from crashed container
# Namespace management
kubectl get namespaces # List all namespaces
kubectl create namespace myapp # Create namespace
kubectl delete namespace myapp # Delete namespace (careful!)
# Service and ingress
kubectl get svc -A # All services
kubectl get ingress -A # All ingresses
kubectl get endpoints -n zantu # See service endpoints
# Deployments and scaling
kubectl get deployments -n zantu # List deployments
kubectl scale deployment zantu --replicas=5 -n zantu # Scale up/down
kubectl rollout restart deployment zantu -n zantu # Restart deployment
kubectl rollout status deployment zantu -n zantu # Check rollout status
# Debugging
kubectl exec -it <pod-name> -n zantu -- /bin/sh # Shell into pod
kubectl port-forward -n zantu svc/zantu 8080:80 # Test service locally
kubectl get events -n zantu --sort-by='.lastTimestamp' # Recent events
kubectl get secrets -n zantu # List secrets in namespace
kubectl describe deployment zantu -n zantu | grep -A5 "Image Pull Secrets" # Check if secret is attached
kubectl get serviceaccount default -n zantu -o yaml # Check default SA for imagePullSecrets
# Cluster info
kubectl cluster-info # Cluster endpoints
kubectl get componentstatuses # Control plane health
kubectl api-resources # All available resourcesDebugging ImagePullBackOff errors specifically:
# Check if ghcr-secret exists in namespace
kubectl get secret ghcr-secret -n zantu
# If missing, you'll see: Error from server (NotFound): secrets "ghcr-secret" not found
# Verify deployment references the secret
kubectl get deployment zantu -n zantu -o yaml | grep -A3 imagePullSecrets
# Should show:
# imagePullSecrets:
# - name: ghcr-secret
# Check pod events for pull errors
kubectl describe pod <pod-name> -n zantu | grep -A10 EventsYou can also see this in Rancher UI:
- Go to your cluster → Workloads → Deployments
- Click on your deployment
- Scroll to “Image Pull Secrets” section
- If it shows “None” but you’re pulling from private registry → that’s your problem
Pro tip: Create aliases in your shell:
# Add to ~/.bashrc or ~/.zshrc
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgn='kubectl get nodes'
alias kga='kubectl get all -A'
alias kd='kubectl describe'
alias kl='kubectl logs -f'Part 1: Preparing Your Nodes
System Preparation (All Nodes)
SSH into each node and run these commands. Yes, all of them. I learned the hard way what happens when you skip steps.
# Update system
zypper refresh && zypper update -y # SUSE
# OR
apt update && apt upgrade -y # Ubuntu
# Install required packages
zypper install -y curl wget git vim tmux # SUSE
# OR
apt install -y curl wget git vim tmux # Ubuntu
# Disable firewall (we'll configure it properly later)
systemctl disable --now firewalld # SUSE
# OR
ufw disable # Ubuntu
# Disable swap (Kubernetes requirement - non-negotiable)
swapoff -a
sed -i '/ swap / s/^/#/' /etc/fstab
# Load required kernel modules
cat <<EOF > /etc/modules-load.d/k8s.conf
br_netfilter
overlay
EOF
modprobe br_netfilter
modprobe overlay
# Verify modules loaded
lsmod | grep br_netfilter
lsmod | grep overlay
# Configure sysctl parameters
cat <<EOF > /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sysctl --systemWhy this matters: I skipped the kernel modules once. Spent 4 hours debugging pod networking. Learn from my mistakes.
Set Hostnames (Important for Troubleshooting)
On each node:
# Control plane nodes
hostnamectl set-hostname rke2-cp-01.yourdomain.lan # .121
hostnamectl set-hostname rke2-cp-02.yourdomain.lan # .122
hostnamectl set-hostname rke2-cp-03.yourdomain.lan # .123
# Worker nodes
hostnamectl set-hostname rke2-worker-01.yourdomain.lan # .124
hostnamectl set-hostname rke2-worker-02.yourdomain.lan # .125
hostnamectl set-hostname rke2-worker-03.yourdomain.lan # .126
hostnamectl set-hostname rke2-worker-04.yourdomain.lan # .127Part 2: Installing HAProxy Load Balancer
The load balancer sits in front of your control plane nodes. This is how you achieve true HA—one control plane dies, HAProxy routes to the others.
Install HAProxy
On your HAProxy node (192.168.1.128):
# Install HAProxy
zypper install -y haproxy # SUSE
# OR
apt install -y haproxy # Ubuntu
# Backup default config
cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.backup
# Create new config
cat > /etc/haproxy/haproxy.cfg << 'EOF'
global
log /dev/log local0
log /dev/log local1 notice
maxconn 4096
user haproxy
group haproxy
daemon
defaults
log global
mode tcp
option tcplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
# Stats interface (optional but useful)
listen stats
bind *:9000
mode http
stats enable
stats uri /stats
stats refresh 10s
stats admin if TRUE
# RKE2 API Server (6443)
frontend rke2-api
bind *:6443
mode tcp
option tcplog
default_backend rke2-api-backend
backend rke2-api-backend
mode tcp
balance roundrobin
option tcp-check
server rke2-cp-01 192.168.1.121:6443 check
server rke2-cp-02 192.168.1.122:6443 check
server rke2-cp-03 192.168.1.123:6443 check
# RKE2 Registration (9345)
frontend rke2-registration
bind *:9345
mode tcp
option tcplog
default_backend rke2-registration-backend
backend rke2-registration-backend
mode tcp
balance roundrobin
option tcp-check
server rke2-cp-01 192.168.1.121:9345 check
server rke2-cp-02 192.168.1.122:9345 check
server rke2-cp-03 192.168.1.123:9345 check
EOF
# Enable and start HAProxy
systemctl enable haproxy
systemctl start haproxy
systemctl status haproxyTest HAProxy: Open http://192.168.1.128:9000/stats in your browser. You should see the stats page. All backends will be down (red) until we install RKE2.
Part 3: Installing RKE2 – First Control Plane Node
This is where it gets real. The first control plane node initializes the cluster.
Install RKE2 on First Control Plane (192.168.1.121)
# Download and install RKE2
curl -sfL https://get.rke2.io | sh -
# Create RKE2 config directory
mkdir -p /etc/rancher/rke2
# Create configuration file
cat > /etc/rancher/rke2/config.yaml << 'EOF'
# Cluster configuration
token: YOUR-SECRET-TOKEN-CHANGE-THIS
tls-san:
- 192.168.1.128 # HAProxy IP
- 192.168.1.121 # This node's IP
- rke2-cp-01.yourdomain.lan
# Network configuration
cluster-cidr: 10.42.0.0/16
service-cidr: 10.43.0.0/16
cluster-dns: 10.43.0.10
# Disable components we don't need
disable:
- rke2-ingress-nginx # We'll use Traefik or custom ingress
# Enable components
cni:
- calico # Or canal, cilium - pick one CNI
# Kubelet configuration
kubelet-arg:
- "max-pods=110"
EOF
# Start RKE2
systemctl enable rke2-server.service
systemctl start rke2-server.service
# Wait for it to start (this takes 2-3 minutes)
journalctl -u rke2-server -fWatch for: “Wrote kubeconfig” in the logs. That means it’s ready.
Verify First Node
# Set up kubectl access
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
export PATH=$PATH:/var/lib/rancher/rke2/bin
# Check node status
kubectl get nodes
# Should see:
# NAME STATUS ROLES AGE VERSION
# rke2-cp-01 Ready control-plane,etcd,master 2m v1.28.x+rke2If node shows NotReady: Wait 2 more minutes. CNI takes time to initialize.
Quick reference: See the “Handy kubectl Commands” section earlier for debugging commands. Most useful right now:
kubectl describe node rke2-cp-01 # Why is node NotReady?
kubectl get pods -n kube-system # Are core components running?Get the Join Token
You’ll need this for other nodes:
cat /var/lib/rancher/rke2/server/node-tokenSave this token. You’ll use it for all other nodes.
Part 4: Adding Additional Control Plane Nodes
Now we add redundancy. This is what makes your cluster production-grade.
On Second Control Plane (192.168.1.122)
# Install RKE2
curl -sfL https://get.rke2.io | sh -
# Create config (NOTE: we point to HAProxy, not first node directly)
mkdir -p /etc/rancher/rke2
cat > /etc/rancher/rke2/config.yaml << 'EOF'
server: https://192.168.1.128:9345 # HAProxy IP!
token: YOUR-TOKEN-FROM-FIRST-NODE
tls-san:
- 192.168.1.128
- 192.168.1.122
- rke2-cp-02.yourdomain.lan
EOF
# Start service
systemctl enable rke2-server.service
systemctl start rke2-server.service
# Monitor logs
journalctl -u rke2-server -fOn Third Control Plane (192.168.1.123)
Same process, just change IPs and hostname in config.yaml:
server: https://192.168.1.128:9345
token: YOUR-TOKEN-FROM-FIRST-NODE
tls-san:
- 192.168.1.128
- 192.168.1.123
- rke2-cp-03.yourdomain.lanVerify All Control Planes
From your local machine (remember, we set up kubectl earlier):
kubectl get nodesYou should see all 3 control plane nodes in Ready state.
Part 5: Adding Worker Nodes
Workers are where your actual applications run. This is where you need more resources.
On Each Worker Node
# Install RKE2 agent (not server!)
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" sh -
# Create config
mkdir -p /etc/rancher/rke2
cat > /etc/rancher/rke2/config.yaml << 'EOF'
server: https://192.168.1.128:9345
token: YOUR-TOKEN-FROM-FIRST-NODE
node-label:
- "node-role.kubernetes.io/worker=true"
EOF
# Start agent
systemctl enable rke2-agent.service
systemctl start rke2-agent.service
# Monitor
journalctl -u rke2-agent -fRepeat for all worker nodes (.124, .125, .126, .127).
Verify Complete Cluster
kubectl get nodes -o wideYou should see:
- 3 control-plane nodes
- 4 worker nodes
- All showing Ready status
Useful commands for this stage:
kubectl get nodes -o wide # See IPs and Kubernetes versions
kubectl top nodes # Check resource usage
kubectl get pods -A -o wide # See all pods and which nodes they're onPart 6: Installing Rancher Management Platform
Rancher gives you a UI to manage everything. It’s optional but incredibly useful.
Install Cert-Manager First (Required)
# Add Helm (if not installed)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
# Wait for it to be ready
kubectl get pods -n cert-managerInstall Rancher
# Add Rancher Helm repo
helm repo add rancher-latest https://releases.rancher.com/server-charts/latest
helm repo update
# Create namespace
kubectl create namespace cattle-system
# Install Rancher
helm install rancher rancher-latest/rancher \
--namespace cattle-system \
--set hostname=rancher.yourdomain.lan \
--set replicas=3 \
--set bootstrapPassword=CHANGE-THIS-PASSWORD
# Wait for deployment
kubectl -n cattle-system rollout status deploy/rancherAccess Rancher UI
- Add DNS entry: rancher.yourdomain.lan → 192.168.1.128 (HAProxy)
- Or add to /etc/hosts:
192.168.1.128 rancher.yourdomain.lan - Open browser: https://rancher.yourdomain.lan
- Login with bootstrap password
- Set new password when prompted
You now have a Rancher-managed RKE2 cluster!
Part 7: Installing MetalLB for LoadBalancer Services
On bare metal (no cloud provider), Kubernetes LoadBalancer services stay in “Pending” state forever. MetalLB fixes this by providing load balancer functionality using your own IP pool.
Why MetalLB Matters
Without it, the only way to expose services externally is:
- NodePort (ugly, non-standard ports)
- Ingress (adds complexity for simple services)
With MetalLB, you get real LoadBalancer IPs just like cloud providers give you.
Install MetalLB
# Install MetalLB via manifest
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.0/config/manifests/metallb-native.yaml
# Wait for MetalLB pods to be ready
kubectl wait --namespace metallb-system \
--for=condition=ready pod \
--selector=app=metallb \
--timeout=90sConfigure IP Address Pool
You need to give MetalLB a pool of IPs it can assign. These should be:
- On the same network as your cluster
- Not used by DHCP
- Reserved/excluded from your router’s DHCP range
Example: Reserve 192.168.1.200-192.168.1.220
# metallb-config.yaml
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default-pool
namespace: metallb-system
spec:
addresses:
- 192.168.1.200-192.168.1.220
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default
namespace: metallb-system
spec:
ipAddressPools:
- default-poolApply it:
kubectl apply -f metallb-config.yamlTest MetalLB
Create a test service:
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=LoadBalancerCheck the external IP:
kubectl get svc nginxYou should see an IP from your pool (e.g., 192.168.1.200). Open it in your browser—nginx welcome page should appear.
MetalLB is now providing LoadBalancer IPs on your bare metal cluster.
Clean up test:
kubectl delete svc nginx
kubectl delete deployment nginxPart 8: Deploying Your First Real Application
Let’s deploy something real—not hello-world. I’ll show you how to deploy a containerized application from GitHub Container Registry.
Full disclosure: The application I’m using as an example is Zantu, the multi-tenant SaaS platform I’m building. It’s my actual production application, not a toy demo.
Why does this matter? Because I’m showing you a real production setup with:
- High-availability database configuration
- Pod anti-affinity rules for resilience
- Actual resource limits based on real usage
- Monitoring and health checks that caught real bugs
Your application will be different, but the patterns are universal. Adapt the specifics to your needs.
⚠️ CRITICAL GOTCHA: If you’re pulling images from ghcr.io (or any private registry), you MUST create an imagePullSecret in each namespace. Without it, pods will show “ImagePullBackOff” errors and you’ll waste hours debugging. I know. I did it.
Example: Deploying a SaaS Application (Zantu)
Prerequisites:
- Docker image built and pushed to GitHub Container Registry (ghcr.io)
- Application requires PostgreSQL database
Create Namespace and Registry Secret
CRITICAL: Before deploying anything from GitHub Container Registry, you need credentials. I spent 3 hours debugging “ImagePullBackOff” errors before realizing this. Don’t be me.
# Create namespace
kubectl create namespace zantu
# Create secret for GitHub Container Registry
# You need a GitHub Personal Access Token (PAT) with read:packages permission
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=YOUR-GITHUB-USERNAME \
--docker-password=YOUR-GITHUB-PAT \
--docker-email=your-email@example.com \
-n zantuHow to get a GitHub PAT:
- GitHub → Settings → Developer Settings → Personal Access Tokens → Tokens (classic)
- Generate new token
- Select scope:
read:packages - Copy the token (you won’t see it again!)
CRITICAL: Secrets Don’t Cross Namespaces
This secret only exists in the zantu namespace. If you:
- Create a new namespace → Need to recreate the secret
- Delete and recreate a namespace → Need to recreate the secret
- Deploy to multiple namespaces → Need the secret in EACH one
Example – What happens if you forget:
# You deploy to a new namespace without the secret
kubectl create namespace myapp
kubectl apply -f deployment.yaml -n myapp
# Your pods will fail with ImagePullBackOff
kubectl get pods -n myapp
# NAME READY STATUS RESTARTS AGE
# myapp-7d5f8c4b9-abcde 0/1 ImagePullBackOff 0 2m
# Check the error
kubectl describe pod myapp-7d5f8c4b9-abcde -n myapp
# Events:
# Failed to pull image "ghcr.io/user/app:latest":
# Error: pull access denied, authentication requiredThe fix – recreate the secret in the new namespace:
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=YOUR-GITHUB-USERNAME \
--docker-password=YOUR-GITHUB-PAT \
--docker-email=your-email@example.com \
-n myapp # Note: different namespace!
# Now restart the deployment to pull the image
kubectl rollout restart deployment myapp -n myappPro tip – Create a script for this:
Save this as create-ghcr-secret.sh:
#!/bin/bash
# Usage: ./create-ghcr-secret.sh namespace-name
NAMESPACE=$1
GITHUB_USERNAME="your-username"
GITHUB_PAT="ghp_your_token_here" # Or use: read -sp "GitHub PAT: " GITHUB_PAT
GITHUB_EMAIL="your-email@example.com"
if [ -z "$NAMESPACE" ]; then
echo "Usage: $0 <namespace>"
exit 1
fi
echo "Creating ghcr-secret in namespace: $NAMESPACE"
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=$GITHUB_USERNAME \
--docker-password=$GITHUB_PAT \
--docker-email=$GITHUB_EMAIL \
-n $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
echo "Done! Secret created in $NAMESPACE"Make it executable:
chmod +x create-ghcr-secret.shNow whenever you create a new namespace that needs GitHub Container Registry access:
./create-ghcr-secret.sh production
./create-ghcr-secret.sh staging
./create-ghcr-secret.sh developmentWhy this is so annoying:
Kubernetes doesn’t share secrets across namespaces by design (security isolation). But this means if you delete a namespace to “clean up,” you lose the secret too. When you recreate the namespace, you must recreate the secret or your deployments will fail.
I’ve wasted hours on this. Now you won’t.
Deploy PostgreSQL (Production HA Setup)
For production, you want high-availability PostgreSQL with streaming replication. This gives you:
- 1 primary (handles writes)
- 2+ replicas (handle reads, can be promoted to primary if primary fails)
Why not just replicas: 3 on a basic StatefulSet? That creates 3 independent databases, not 1 database with replication. Big difference.
Option 1: Use a Postgres Operator (Recommended)
The easiest production-ready approach is Zalando’s Postgres Operator:
# Add Zalando Helm repo
helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator
helm repo update
# Install the operator
helm install postgres-operator postgres-operator-charts/postgres-operator \
--namespace postgres-operator --create-namespace
# Install the UI (optional but useful)
helm install postgres-operator-ui postgres-operator-charts/postgres-operator-ui \
--namespace postgres-operatorThen create a PostgreSQL cluster:
# postgres-cluster.yaml
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: zantu-postgres
namespace: zantu
spec:
teamId: "zantu"
volume:
size: 10Gi
numberOfInstances: 3 # 1 primary + 2 replicas
users:
zantu: # application user
- superuser
- createdb
databases:
zantu: zantu # database owned by user
postgresql:
version: "15"
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
# Pod anti-affinity - spread across workers
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
application: spilo
cluster-name: zantu-postgres
topologyKey: kubernetes.io/hostnameApply it:
kubectl apply -f postgres-cluster.yamlWhat this gives you:
- Automatic failover (if primary dies, replica promotes)
- Connection pooling (via pgBouncer)
- Backup/restore capabilities
- Read replicas for scaling reads
Get the connection string:
# Primary (read/write)
kubectl get secret zantu.zantu-postgres.credentials.postgresql.acid.zalan.do \
-n zantu -o jsonpath='{.data.password}' | base64 -d
# Connection string:
# postgresql://zantu:<password>@zantu-postgres:5432/zantuOption 2: Manual StatefulSet with Replication (If You Want Control)
If you want to understand what the operator is doing under the hood:
# postgres-ha.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-config
namespace: zantu
data:
POSTGRES_DB: "zantu"
POSTGRES_USER: "zantu"
PGDATA: "/var/lib/postgresql/data/pgdata"
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: zantu
spec:
serviceName: postgres
replicas: 3 # 1 primary + 2 replicas
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
# Pod anti-affinity - spread across workers
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- postgres
topologyKey: "kubernetes.io/hostname"
containers:
- name: postgres
image: postgres:15-alpine
ports:
- containerPort: 5432
name: postgres
envFrom:
- configMapRef:
name: postgres-config
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: POSTGRES_REPLICATION_MODE
value: "master" # Override this for replicas in init script
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
livenessProbe:
exec:
command:
- pg_isready
- -U
- zantu
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command:
- pg_isready
- -U
- zantu
initialDelaySeconds: 5
periodSeconds: 5
volumeClaimTemplates:
- metadata:
name: postgres-storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: postgres
namespace: zantu
spec:
selector:
app: postgres
ports:
- port: 5432
clusterIP: None # Headless service
---
apiVersion: v1
kind: Service
metadata:
name: postgres-primary
namespace: zantu
spec:
selector:
app: postgres
role: primary # Only route to primary
ports:
- port: 5432
---
apiVersion: v1
kind: Secret
metadata:
name: postgres-secret
namespace: zantu
type: Opaque
stringData:
password: "CHANGE-THIS-PASSWORD"Apply it:
kubectl apply -f postgres-ha.yamlUnderstanding Pod Anti-Affinity:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- postgres
topologyKey: "kubernetes.io/hostname"What this does:
requiredDuringScheduling→ Kubernetes MUST place pods on different nodesmatchExpressions: app=postgres→ Don’t put postgres pods togethertopologyKey: kubernetes.io/hostname→ “Different nodes” means different hostnames
Result: If you have 3 postgres replicas and 4 workers:
- postgres-0 → worker-01
- postgres-1 → worker-02
- postgres-2 → worker-03
If worker-01 dies, postgres-0 dies, but postgres-1 and postgres-2 keep running on worker-02 and worker-03.
For my Zantu deployment, I use the Zalando operator. It handles replication, failover, backups automatically. The manual approach teaches you the concepts, but the operator is what you run in production.
Connection string for your app:
DATABASE_URL=postgresql://zantu:password@zantu-postgres:5432/zantu # Operator
DATABASE_URL=postgresql://zantu:password@postgres-primary:5432/zantu # ManualDeploy Application
Now deploy Zantu with proper HA configuration:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: zantu
namespace: zantu
spec:
replicas: 3
selector:
matchLabels:
app: zantu
template:
metadata:
labels:
app: zantu
spec:
# Pod anti-affinity - spread across workers
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- zantu
topologyKey: kubernetes.io/hostname
imagePullSecrets:
- name: ghcr-secret
containers:
- name: zantu
image: ghcr.io/your-username/zantu:latest
ports:
- containerPort: 5000
env:
- name: DATABASE_URL
value: "postgresql://zantu:CHANGE-THIS@zantu-postgres:5432/zantu" # Or postgres-primary
- name: NODE_ENV
value: "production"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: 5000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /readyz
port: 5000
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: zantu
namespace: zantu
spec:
selector:
app: zantu
ports:
- port: 80
targetPort: 5000
type: ClusterIPUnderstanding the anti-affinity rule:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution: # "Try hard, but not required"
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- zantu
topologyKey: kubernetes.io/hostnameWhat this means:
preferredDuringScheduling→ Try to spread pods, but if you can’t (e.g., only 2 workers available but 3 replicas), place them anywayweight: 100→ High priority for spreading (0-100 scale)- Result: Kubernetes tries to put each zantu pod on a different worker
Why “preferred” vs “required”?
- Required: Strict. If you have 3 replicas but only 2 workers, 3rd pod stays pending forever
- Preferred: Flexible. Spread when possible, but allow bunching if necessary
For application pods, use “preferred.” For stateful services like databases, use “required.”
Test the anti-affinity:
kubectl apply -f deployment.yaml
# Check pod distribution
kubectl get pods -n zantu -o wide
# You should see pods spread across different worker nodes:
# NAME READY STATUS RESTARTS AGE NODE
# zantu-7d5f8c4b9-abcde 1/1 Running 0 2m rke2-worker-01
# zantu-7d5f8c4b9-fghij 1/1 Running 0 2m rke2-worker-02
# zantu-7d5f8c4b9-klmno 1/1 Running 0 2m rke2-worker-03Now kill a worker and watch:
You have two options:
Option 1: Graceful drain (boring but proper)
# From your local machine
kubectl drain rke2-worker-01 --ignore-daemonsets --delete-emptydir-data
# Then shut down the VM
# In Proxmox: Right-click → ShutdownThis gives pods time to migrate. Takes 30-60 seconds. Very civilized.
Option 2: Brutal kill (what actually happens in real failures)
# In Proxmox: Right-click → Stop (not shutdown, STOP)
# Or just pull the power cord from physical serverThis is what we’re doing. We like to be brutal about it and just kill the VM, hoping for the best.
Because in production, servers don’t gracefully drain themselves before the power supply fails or the kernel panics. They just DIE.
Watch what happens:
# Shut down worker-01
# In Proxmox: Stop VM
# Watch pod rescheduling
kubectl get pods -n zantu -o wide -w
# The pod from worker-01 will reschedule to worker-04
# Your application stays up because pods on worker-02 and worker-03 keep runningThis is production HA in action.
Apply it:
kubectl apply -f deployment.yamlVerify Deployment
# Check pods
kubectl get pods -n zantu
# Check logs
kubectl logs -f deployment/zantu -n zantu
# Port-forward to test locally
kubectl port-forward -n zantu service/zantu 8080:80Open http://localhost:8080 – your app should be running!
Automating Deployments with a Script
SSHing into nodes every time you update your app is tedious. Here’s a deployment script you can run from your local machine.
Create deploy.sh on your local machine:
#!/bin/bash
# Zantu Deployment Script - Run from your local machine
# Usage: ./deploy.sh [tag]
# Example: ./deploy.sh v1.2.3
set -e
# Configuration
IMAGE_NAME="ghcr.io/your-username/zantu"
TAG="${1:-latest}"
NAMESPACE="zantu"
DEPLOYMENT="zantu"
echo "=========================================="
echo " Deploying Zantu ${TAG}"
echo "=========================================="
# Check if kubectl is configured
if ! kubectl cluster-info &> /dev/null; then
echo "ERROR: kubectl is not configured or cluster is unreachable"
exit 1
fi
echo ""
echo "1. Checking current deployment status..."
kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE}
echo ""
echo "2. Updating image to ${IMAGE_NAME}:${TAG}..."
kubectl set image deployment/${DEPLOYMENT} \
${DEPLOYMENT}=${IMAGE_NAME}:${TAG} \
-n ${NAMESPACE}
echo ""
echo "3. Watching rollout status..."
kubectl rollout status deployment/${DEPLOYMENT} -n ${NAMESPACE}
echo ""
echo "4. Verifying pods are running..."
kubectl get pods -n ${NAMESPACE} -l app=${DEPLOYMENT}
echo ""
echo "=========================================="
echo " Deployment Complete!"
echo "=========================================="
echo ""
echo "Check logs: kubectl logs -f deployment/${DEPLOYMENT} -n ${NAMESPACE}"
echo "Rollback: kubectl rollout undo deployment/${DEPLOYMENT} -n ${NAMESPACE}"
echo ""Make it executable:
chmod +x deploy.shUsage:
# Deploy latest
./deploy.sh
# Deploy specific version
./deploy.sh v1.2.3
# Rollback if something breaks
kubectl rollout undo deployment/zantu -n zantuWhat this does:
- Updates the deployment’s container image
- Triggers rolling update (zero downtime)
- Waits for new pods to be healthy
- Shows you the status
Pro tip: Add this to your CI/CD pipeline (GitHub Actions, GitLab CI) to auto-deploy on git push.
GitHub Actions example:
# .github/workflows/deploy.yml
name: Deploy to Kubernetes
on:
push:
branches: [ main ]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Build and push Docker image
run: |
echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .
docker push ghcr.io/${{ github.repository }}:${{ github.sha }}
- name: Deploy to cluster
run: |
echo "${{ secrets.KUBECONFIG }}" > kubeconfig
export KUBECONFIG=./kubeconfig
kubectl set image deployment/zantu zantu=ghcr.io/${{ github.repository }}:${{ github.sha }} -n zantu
kubectl rollout status deployment/zantu -n zantuNow your deployments are one git push away.
Part 9: Real-World Learnings (The Expensive Lessons)
etcd on NVMe: Worth It?
Short answer: Yes, if you have it. But understand the tradeoffs.
I moved etcd (the control plane database) to NVMe expecting massive performance gains. What I learned:
Consumer NVMe (Samsung 980 Pro, WD Black, etc):
- ✅ Perfect for testing and homelab
- ✅ Massive improvement over SATA SSDs for responsiveness
- ⚠️ Write amplification under constant etcd workload
- ⚠️ Degrades over time with small random writes
- 💰 Cost: $50-150 per drive
Enterprise NVMe (Intel P5800X, Micron 9300, Samsung PM series):
- ✅ Handles etcd write patterns without degradation
- ✅ Power-loss protection (critical for etcd consistency)
- ✅ 3-5x endurance rating vs consumer drives
- 💰 Cost: $400-600 per 480GB drive
Performance gain either way: 20-30% reduction in API response times under load.
My recommendation:
- Homelab/learning: Consumer NVMe is totally fine
- Production with real users: Enterprise NVMe or good SATA SSDs
- Mission-critical production: Enterprise NVMe with proper backups
Cost analysis for production:
- Enterprise 480GB NVMe: ~$400-600
- 3x drives for HA: ~$1,500
- Alternative: Quality SATA SSDs work great and cost 1/3 as much
Verdict: Consumer NVMe for testing/learning, enterprise for production. Or just use good SATA SSDs and skip the cost entirely—they’re fine for most use cases.
Chaos Engineering: What Actually Breaks
I deliberately broke my cluster in various ways. Here’s what I learned:
Test 1: Kill a control plane node
- Expected: Cluster continues normally
- Reality: 40-second hiccup while etcd re-elects leader
- Fix: Reduce etcd heartbeat interval (not recommended for most)
Test 2: Kill a worker node
- Expected: Pods reschedule to other workers
- Reality: 5-minute delay before Kubernetes notices node is down
- Fix: Configure node-monitor-period and node-monitor-grace-period
Test 3: Fill a worker’s disk
- Expected: Graceful degradation
- Reality: Node goes into EvictionHard state, kills ALL pods
- Fix: Monitor disk usage, set proper eviction thresholds
Test 4: Network partition (split-brain scenario)
- Expected: Cluster handles it gracefully
- Reality: Got two separate clusters until network restored
- Fix: Ensure odd number of control planes (3, not 2 or 4)
Resource Limits: The Hidden Gotcha
What the documentation says: “Set resource requests and limits on all pods”
What actually happens:
- Set limits too low → pods get OOMKilled constantly
- Set limits too high → waste resources, can’t schedule pods
- Set requests without limits → one pod can starve others
My approach after 100+ restarts:
# For typical web apps:
resources:
requests:
cpu: 100m # Actual average usage
memory: 256Mi # 2x average usage
limits:
cpu: 500m # 5x average, allows bursts
memory: 512Mi # 2x requests, prevents runawayMonitor actual usage for a week, then adjust. Theoretical planning fails here.
Chaos Monkey Exercise: Kill and Replace Control Plane 1
Why do this? Because someday CP1 WILL die. Better to learn how to handle it when you’re awake and caffeinated.
The scenario: Your first control plane node dies catastrophically. Maybe motherboard failure, maybe you accidentally formatted it while half-asleep. It happens.
Step 1: Verify cluster health before chaos
kubectl get nodes
kubectl get pods -AEverything should be healthy. Take a screenshot—you’ll want proof later.
Step 2: Kill CP1 (192.168.1.121)
Shut it down hard:
# On CP1 node
poweroffOr in Proxmox: Right-click VM → Stop (not shutdown, STOP).
Step 3: Watch what happens
From CP2 or CP3:
# Watch nodes
kubectl get nodes -w
# Watch etcd leader election
kubectl get pods -n kube-system | grep etcdWhat you’ll observe:
- CP1 goes NotReady after ~40 seconds
- etcd re-elects leader (2-5 second hiccup in API)
- Cluster continues operating normally
- Pods keep running on workers
- Rancher UI might hiccup briefly but recovers
This is HA working correctly.
Step 4: Replace CP1 with fresh node
Now the fun part—rebuilding a control plane node from scratch while cluster is live.
Option A: Rebuild same node
- Reinstall OS on CP1 (192.168.1.121)
- Follow “Part 4: Adding Additional Control Plane Nodes” steps
- Point to HAProxy:
server: https://192.168.1.128:9345 - Use same token from surviving control planes
- Start rke2-server
The new CP1 will join the existing cluster and sync etcd from CP2/CP3.
Option B: Replace with different node
Maybe CP1 hardware is dead. Build a new VM/server:
- Assign NEW IP (e.g., 192.168.1.129)
- Install OS, hostname: rke2-cp-04
- Follow control plane join procedure
- After it’s healthy, update HAProxy config to remove old .121, add new .129
# On HAProxy node, edit config
vi /etc/haproxy/haproxy.cfg
# In both backend sections, replace:
# server rke2-cp-01 192.168.1.121:6443 check
# server rke2-cp-01 192.168.1.121:9345 check
# With:
# server rke2-cp-04 192.168.1.129:6443 check
# server rke2-cp-04 192.168.1.129:9345 check
# Reload HAProxy (no downtime!)
systemctl reload haproxyStep 5: Clean up old node from cluster
Once new control plane is healthy:
# Delete old node from Kubernetes
kubectl delete node rke2-cp-01
# Verify cluster health
kubectl get nodes
kubectl get componentstatusesStep 6: Verify etcd cluster
# On any surviving control plane
export ETCDCTL_API=3
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
member list
# Should show 3 healthy members (2 old + 1 new)What you learned:
✅ Control plane nodes are replaceable (they’re cattle, not pets)
✅ etcd quorum (2/3) keeps cluster alive
✅ HAProxy makes the replacement transparent to clients
✅ You can rebuild infrastructure without taking down applications
Do this exercise quarterly. Muscle memory matters when production breaks.
Part 10: Maintenance & Operations
Backing Up etcd (Critical!)
You will lose data if you don’t do this.
# On a control plane node
ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key
# Copy snapshot to safe location
# Automate this with a cronjob!Upgrading RKE2
Never upgrade all nodes at once. Never.
# Upgrade first control plane
systemctl stop rke2-server
curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION=v1.28.x+rke2r1 sh -
systemctl start rke2-server
# Wait 5 minutes, verify cluster health
kubectl get nodes
# Repeat for other control planes one at a time
# Then upgrade workers one at a timeMonitoring Stack (Prometheus + Grafana)
Don’t skip this. You need metrics to understand what’s actually happening in your cluster.
Easy way (via Rancher UI):
- Open Rancher → Your Cluster
- Go to “Apps” → “Charts”
- Search for “Monitoring”
- Click “Install”
- Set Grafana admin password
- Click “Install”
Done. Rancher configures everything correctly. Access Grafana through Rancher’s UI.
Manual way (via Helm):
If you want to understand what Rancher is doing under the hood:
# Add Prometheus Community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Create monitoring namespace
kubectl create namespace monitoring
# Install the full stack (Prometheus + Grafana + AlertManager + exporters)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set grafana.adminPassword=CHANGE-THIS-PASSWORD
# Wait for pods to be ready (takes 2-3 minutes)
kubectl get pods -n monitoring -wAccess Grafana:
# Port-forward Grafana to your local machine
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80Open http://localhost:3000
- Username:
admin - Password: Whatever you set in
--set grafana.adminPassword
Pre-built dashboards: Grafana comes with dashboards already configured. Check:
- Kubernetes / Compute Resources / Cluster
- Kubernetes / Compute Resources / Namespace (Pods)
- Node Exporter / Nodes
What you’re monitoring:
- Cluster resource usage (CPU, memory, disk)
- Pod health and restarts
- Node health
- etcd performance
- API server latency
Set up alerts:
The stack includes AlertManager but you need to configure it:
# Edit AlertManager config
kubectl -n monitoring edit secret alertmanager-kube-prometheus-stack-alertmanagerAdd your notification channels (Slack, email, PagerDuty). This is critical for production—you need to know when things break before your users do.
What You’ve Built
You now have:
✅ High-availability Kubernetes cluster (3 control planes)
✅ Load-balanced control plane (HAProxy)
✅ 4 worker nodes ready for workloads
✅ Rancher management UI
✅ MetalLB for LoadBalancer services on bare metal
✅ Real application deployed from GitHub
✅ Understanding of failure modes through chaos testing
✅ Maintenance procedures that actually work
The Real Lessons
Building this cluster taught me more about Kubernetes than any certification ever could.
Not because I followed a perfect plan. Because I:
- Built something I actually needed (my hosting platform)
- Shipped fast, iterated based on reality
- Broke things deliberately to understand failure modes
- Documented the expensive lessons so you don’t repeat them
Total cost: ~$0 if you have hardware, ~$100-200/month if using cloud VMs for learning
Time investment: 2-3 days following this guide for basic deployment. I spent a week doing 12-hour days including all the chaos testing, breaking things deliberately, and documenting every mistake. You get to skip my debugging sessions.
ROI: You now understand infrastructure at a level most developers never reach.
Questions Worth Asking
“Why not just use k3s?”
Valid option. RKE2 is production-hardened for enterprise. k3s is lighter. Pick based on your requirements.
“Why HAProxy instead of MetalLB for control plane?”
HAProxy gives you a single, stable endpoint for the control plane. MetalLB is for application services. Different jobs.
“Isn’t this overengineered?”
Depends on your goal. If you want to run containers, yes. If you want to understand infrastructure deeply enough to troubleshoot production failures, no.
Compare this to running the same stack with docker-compose:
- Setup time: 5 minutes vs 2 days
- What you learn: How to run containers vs How infrastructure actually works
- What breaks at 2am: Everything, and you’re helpless vs Specific component, and you know exactly how to fix it
I wrote a companion article showing the docker-compose version—same Zantu application, 10% of the complexity, 10% of the understanding: “Deploying Zantu with Docker Compose: When 5 Minutes Beats 2 Days“
“Why bare metal instead of managed Kubernetes?”
Control and learning. Managed services hide complexity—that’s a feature until it’s a bug and you’re helpless at 2am.
Next Steps
- Harden security: Configure NetworkPolicies, PodSecurityPolicies
- Add storage: Install Longhorn for persistent volumes
- Setup CI/CD: Deploy ArgoCD or Flux for GitOps
- Monitor costs: Set up resource quotas and limits
- Document everything: Future you will thank present you
Questions? Problems? The official RKE2 docs are actually good: https://docs.rke2.io
But remember: Documentation tells you what to do. Breaking things in production teaches you why.
Build it. Break it. Learn from it.
That’s the system.
For the thought leadership angle on why hands-on learning beats theory, see the companion article: “The Pottery Class Paradox: Why Rapid Iteration + Reflection Beats Both Quantity and Quality“