High Availability

Configure Laminar for high availability in production environments.

Overview

High availability ensures Laminar remains operational during:

Node failures - Pods automatically reschedule
Zone outages - Multi-zone deployment
Component failures - Redundant replicas
Rolling updates - Zero-downtime deployments

Minimum HA Configuration

api:
  replicas: 2
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
 
controller:
  replicas: 2
  leaderElection:
    enabled: true
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
  persistence:
    size: 100Gi
    storageClass: gp3

API Server HA

Multiple Replicas

api:
  replicas: 3
 
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
 
  podDisruptionBudget:
    enabled: true
    minAvailable: 2
 
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: laminar-api
            topologyKey: kubernetes.io/hostname

Multi-Zone Distribution

api:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: laminar-api

Controller HA

The controller uses leader election to ensure only one active instance processes pipelines while others stand by.

controller:
  replicas: 2
 
  leaderElection:
    enabled: true
    leaseDuration: 15s
    renewDeadline: 10s
    retryPeriod: 2s
 
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
 
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: laminar-controller
          topologyKey: kubernetes.io/hostname
 
  # RocksDB storage
  persistence:
    enabled: true
    size: 100Gi
    storageClass: gp3

RocksDB Storage HA

RocksDB is an embedded database, so HA focuses on data durability:

Fast Storage

Use high-performance SSDs for RocksDB:

controller:
  persistence:
    storageClass: gp3  # AWS
    # storageClass: premium-rwo  # GCP
    # storageClass: managed-csi-premium  # Azure
    size: 100Gi

Backup to Object Storage

Configure checkpoint storage for recovery:

storage:
  checkpoints:
    url: "s3://my-bucket/laminar/checkpoints"
    interval: "30s"

Load Balancing

Kubernetes Service

api:
  service:
    type: ClusterIP
    sessionAffinity: None

Ingress Health Checks

api:
  ingress:
    enabled: true
    annotations:
      # AWS ALB
      alb.ingress.kubernetes.io/healthcheck-path: /health
      alb.ingress.kubernetes.io/healthcheck-interval-seconds: "15"
      # NGINX
      nginx.ingress.kubernetes.io/upstream-hash-by: "$remote_addr"

Health Probes

api:
  livenessProbe:
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 30
    periodSeconds: 10
    failureThreshold: 3
 
  readinessProbe:
    httpGet:
      path: /ready
      port: 8000
    initialDelaySeconds: 5
    periodSeconds: 5
    failureThreshold: 3
 
  startupProbe:
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 10
    periodSeconds: 10
    failureThreshold: 30

Pod Disruption Budgets

Prevent voluntary disruptions from taking down too many pods:

api:
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
    # or
    # maxUnavailable: 1
 
controller:
  podDisruptionBudget:
    enabled: true
    minAvailable: 1

Graceful Shutdown

api:
  terminationGracePeriodSeconds: 60
 
  lifecycle:
    preStop:
      exec:
        command:
          - /bin/sh
          - -c
          - sleep 10

Disaster Recovery

Object Storage for Checkpoints

Store checkpoints in durable object storage:

storage:
  checkpoints:
    url: "s3://my-bucket/checkpoints"
  artifacts:
    url: "s3://my-bucket/artifacts"

Cross-Region Replication

Enable cross-region replication for S3/GCS buckets to protect against regional failures.

Monitoring HA Status

# Check pod distribution
kubectl get pods -n laminar -o wide
 
# Check PDB status
kubectl get pdb -n laminar
 
# Check node zones
kubectl get nodes --label-columns=topology.kubernetes.io/zone
 
# Check endpoint health
kubectl get endpoints -n laminar
 
# Check PVC status
kubectl get pvc -n laminar

HA Checklist

Next Steps

Security Hardening - Security best practices
Resource Sizing - Capacity planning
Backup & Restore - Disaster recovery