High Availability
Configure Laminar for high availability in production environments.
Overview
High availability ensures Laminar remains operational during:
- Node failures - Pods automatically reschedule
- Zone outages - Multi-zone deployment
- Component failures - Redundant replicas
- Rolling updates - Zero-downtime deployments
Minimum HA Configuration
api:
replicas: 2
resources:
requests:
cpu: 500m
memory: 512Mi
podDisruptionBudget:
enabled: true
minAvailable: 1
controller:
replicas: 2
leaderElection:
enabled: true
podDisruptionBudget:
enabled: true
minAvailable: 1
persistence:
size: 100Gi
storageClass: gp3API Server HA
Multiple Replicas
api:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
podDisruptionBudget:
enabled: true
minAvailable: 2
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: laminar-api
topologyKey: kubernetes.io/hostnameMulti-Zone Distribution
api:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: laminar-apiController HA
The controller uses leader election to ensure only one active instance processes pipelines while others stand by.
controller:
replicas: 2
leaderElection:
enabled: true
leaseDuration: 15s
renewDeadline: 10s
retryPeriod: 2s
podDisruptionBudget:
enabled: true
minAvailable: 1
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: laminar-controller
topologyKey: kubernetes.io/hostname
# RocksDB storage
persistence:
enabled: true
size: 100Gi
storageClass: gp3RocksDB Storage HA
RocksDB is an embedded database, so HA focuses on data durability:
Fast Storage
Use high-performance SSDs for RocksDB:
controller:
persistence:
storageClass: gp3 # AWS
# storageClass: premium-rwo # GCP
# storageClass: managed-csi-premium # Azure
size: 100GiBackup to Object Storage
Configure checkpoint storage for recovery:
storage:
checkpoints:
url: "s3://my-bucket/laminar/checkpoints"
interval: "30s"Load Balancing
Kubernetes Service
api:
service:
type: ClusterIP
sessionAffinity: NoneIngress Health Checks
api:
ingress:
enabled: true
annotations:
# AWS ALB
alb.ingress.kubernetes.io/healthcheck-path: /health
alb.ingress.kubernetes.io/healthcheck-interval-seconds: "15"
# NGINX
nginx.ingress.kubernetes.io/upstream-hash-by: "$remote_addr"Health Probes
api:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 30Pod Disruption Budgets
Prevent voluntary disruptions from taking down too many pods:
api:
podDisruptionBudget:
enabled: true
minAvailable: 1
# or
# maxUnavailable: 1
controller:
podDisruptionBudget:
enabled: true
minAvailable: 1Graceful Shutdown
api:
terminationGracePeriodSeconds: 60
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 10Disaster Recovery
Object Storage for Checkpoints
Store checkpoints in durable object storage:
storage:
checkpoints:
url: "s3://my-bucket/checkpoints"
artifacts:
url: "s3://my-bucket/artifacts"Cross-Region Replication
Enable cross-region replication for S3/GCS buckets to protect against regional failures.
Monitoring HA Status
# Check pod distribution
kubectl get pods -n laminar -o wide
# Check PDB status
kubectl get pdb -n laminar
# Check node zones
kubectl get nodes --label-columns=topology.kubernetes.io/zone
# Check endpoint health
kubectl get endpoints -n laminar
# Check PVC status
kubectl get pvc -n laminarHA Checklist
- Multiple API replicas (minimum 2)
- Multiple controller replicas with leader election
- Pod anti-affinity configured
- Multi-zone distribution enabled
- Pod disruption budgets configured
- Fast SSD storage for RocksDB
- Checkpoints stored in durable object storage
- Health probes configured
- Graceful shutdown configured
- Monitoring and alerting enabled
Next Steps
- Security Hardening - Security best practices
- Resource Sizing - Capacity planning
- Backup & Restore - Disaster recovery