Work in Progress: This page is under development. Use the feedback button on the bottom right to help us improve it.

High Availability

Configure Laminar for high availability in production environments.

Overview

High availability ensures Laminar remains operational during:

  • Node failures - Pods automatically reschedule
  • Zone outages - Multi-zone deployment
  • Component failures - Redundant replicas
  • Rolling updates - Zero-downtime deployments

Minimum HA Configuration

api:
  replicas: 2
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
 
controller:
  replicas: 2
  leaderElection:
    enabled: true
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
  persistence:
    size: 100Gi
    storageClass: gp3

API Server HA

Multiple Replicas

api:
  replicas: 3
 
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
 
  podDisruptionBudget:
    enabled: true
    minAvailable: 2
 
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: laminar-api
            topologyKey: kubernetes.io/hostname

Multi-Zone Distribution

api:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: laminar-api

Controller HA

The controller uses leader election to ensure only one active instance processes pipelines while others stand by.

controller:
  replicas: 2
 
  leaderElection:
    enabled: true
    leaseDuration: 15s
    renewDeadline: 10s
    retryPeriod: 2s
 
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
 
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: laminar-controller
          topologyKey: kubernetes.io/hostname
 
  # RocksDB storage
  persistence:
    enabled: true
    size: 100Gi
    storageClass: gp3

RocksDB Storage HA

RocksDB is an embedded database, so HA focuses on data durability:

Fast Storage

Use high-performance SSDs for RocksDB:

controller:
  persistence:
    storageClass: gp3  # AWS
    # storageClass: premium-rwo  # GCP
    # storageClass: managed-csi-premium  # Azure
    size: 100Gi

Backup to Object Storage

Configure checkpoint storage for recovery:

storage:
  checkpoints:
    url: "s3://my-bucket/laminar/checkpoints"
    interval: "30s"

Load Balancing

Kubernetes Service

api:
  service:
    type: ClusterIP
    sessionAffinity: None

Ingress Health Checks

api:
  ingress:
    enabled: true
    annotations:
      # AWS ALB
      alb.ingress.kubernetes.io/healthcheck-path: /health
      alb.ingress.kubernetes.io/healthcheck-interval-seconds: "15"
      # NGINX
      nginx.ingress.kubernetes.io/upstream-hash-by: "$remote_addr"

Health Probes

api:
  livenessProbe:
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 30
    periodSeconds: 10
    failureThreshold: 3
 
  readinessProbe:
    httpGet:
      path: /ready
      port: 8000
    initialDelaySeconds: 5
    periodSeconds: 5
    failureThreshold: 3
 
  startupProbe:
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 10
    periodSeconds: 10
    failureThreshold: 30

Pod Disruption Budgets

Prevent voluntary disruptions from taking down too many pods:

api:
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
    # or
    # maxUnavailable: 1
 
controller:
  podDisruptionBudget:
    enabled: true
    minAvailable: 1

Graceful Shutdown

api:
  terminationGracePeriodSeconds: 60
 
  lifecycle:
    preStop:
      exec:
        command:
          - /bin/sh
          - -c
          - sleep 10

Disaster Recovery

Object Storage for Checkpoints

Store checkpoints in durable object storage:

storage:
  checkpoints:
    url: "s3://my-bucket/checkpoints"
  artifacts:
    url: "s3://my-bucket/artifacts"

Cross-Region Replication

Enable cross-region replication for S3/GCS buckets to protect against regional failures.


Monitoring HA Status

# Check pod distribution
kubectl get pods -n laminar -o wide
 
# Check PDB status
kubectl get pdb -n laminar
 
# Check node zones
kubectl get nodes --label-columns=topology.kubernetes.io/zone
 
# Check endpoint health
kubectl get endpoints -n laminar
 
# Check PVC status
kubectl get pvc -n laminar

HA Checklist

  • Multiple API replicas (minimum 2)
  • Multiple controller replicas with leader election
  • Pod anti-affinity configured
  • Multi-zone distribution enabled
  • Pod disruption budgets configured
  • Fast SSD storage for RocksDB
  • Checkpoints stored in durable object storage
  • Health probes configured
  • Graceful shutdown configured
  • Monitoring and alerting enabled

Next Steps