Kubernetes Best Practices for Production Deployments
Kubernetes Best Practices for Production Deployments
Running applications in production on Kubernetes is challenging, but following proven best practices can make the difference between a reliable system and a maintenance nightmare. This guide covers the essential strategies that successful teams use to deploy and maintain production workloads.
What You'll Learn
By the end of this guide, you'll understand:
- How to design resilient deployments that handle failures gracefully
- Security practices that protect your applications from common threats
- Monitoring strategies that give you visibility into your system's health
- Scaling patterns that keep your applications performant under load
- Backup and recovery procedures that protect your data
1. Resource Management: The Foundation of Stability
Why Resource Management Matters
Resource management is the cornerstone of stable Kubernetes deployments. Without proper resource allocation, your applications can experience:
- Resource starvation when one pod consumes all available CPU/memory
- OOM kills when containers exceed memory limits
- Poor performance due to CPU throttling
- Unpredictable scaling behavior
Best Practice: Always Set Resource Limits
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: production-app
5spec:
6 replicas: 3
7 selector:
8 matchLabels:
9 app: production-app
10 template:
11 metadata:
12 labels:
13 app: production-app
14 spec:
15 containers:
16 - name: app
17 image: production-app:v1.2.0
18 resources:
19 requests:
20 memory: "512Mi"
21 cpu: "250m"
22 limits:
23 memory: "1Gi"
24 cpu: "500m"
25 livenessProbe:
26 httpGet:
27 path: /health
28 port: 8080
29 initialDelaySeconds: 30
30 periodSeconds: 10
31 timeoutSeconds: 5
32 failureThreshold: 3
33 readinessProbe:
34 httpGet:
35 path: /ready
36 port: 8080
37 initialDelaySeconds: 5
38 periodSeconds: 5
39 timeoutSeconds: 3
40 failureThreshold: 3Key Points:
- Requests: Guaranteed resources your pod will receive
- Limits: Maximum resources your pod can use
- Health Checks: Essential for Kubernetes to know when pods are healthy
- Realistic Values: Base limits on actual usage patterns, not guesses
Pro Tip: Use Resource Quotas
1apiVersion: v1
2kind: ResourceQuota
3metadata:
4 name: production-quota
5spec:
6 hard:
7 requests.cpu: "4"
8 requests.memory: 8Gi
9 limits.cpu: "8"
10 limits.memory: 16Gi
11 pods: "20"2. Security: Protect Your Applications
The Security-First Approach
Security in Kubernetes requires careful attention. The principle of least privilege should guide every security decision.
Best Practice: Run as Non-Root
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: secure-app
5spec:
6 template:
7 spec:
8 securityContext:
9 runAsNonRoot: true
10 runAsUser: 1000
11 fsGroup: 2000
12 containers:
13 - name: app
14 image: secure-app:latest
15 securityContext:
16 allowPrivilegeEscalation: false
17 readOnlyRootFilesystem: true
18 capabilities:
19 drop:
20 - ALL
21 volumeMounts:
22 - name: tmp
23 mountPath: /tmp
24 - name: varlog
25 mountPath: /var/log
26 - name: app-config
27 mountPath: /app/config
28 readOnly: true
29 volumes:
30 - name: tmp
31 emptyDir: {}
32 - name: varlog
33 emptyDir: {}
34 - name: app-config
35 configMap:
36 name: app-configSecurity Benefits:
- Non-root execution prevents privilege escalation attacks
- Read-only filesystem prevents malicious file modifications
- Dropped capabilities remove unnecessary privileges
- ConfigMap mounting keeps configuration separate and secure
Pro Tip: Use Network Policies
1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4 name: app-network-policy
5spec:
6 podSelector:
7 matchLabels:
8 app: production-app
9 policyTypes:
10 - Ingress
11 - Egress
12 ingress:
13 - from:
14 - namespaceSelector:
15 matchLabels:
16 name: frontend
17 ports:
18 - protocol: TCP
19 port: 8080
20 egress:
21 - to:
22 - namespaceSelector:
23 matchLabels:
24 name: database
25 ports:
26 - protocol: TCP
27 port: 54323. Monitoring and Observability: Know Your System
The Three Pillars of Observability
- Metrics: Quantitative data about your system's performance
- Logs: Detailed records of events and errors
- Traces: Request flow through your distributed system
Best Practice: Comprehensive Monitoring Setup
1apiVersion: v1
2kind: ConfigMap
3metadata:
4 name: prometheus-config
5data:
6 prometheus.yml: |
7 global:
8 scrape_interval: 15s
9 evaluation_interval: 15s
10
11 rule_files:
12 - "alert_rules.yml"
13
14 scrape_configs:
15 - job_name: 'kubernetes-pods'
16 kubernetes_sd_configs:
17 - role: pod
18 relabel_configs:
19 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
20 action: keep
21 regex: true
22 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
23 action: replace
24 target_label: __metrics_path__
25 regex: (.+)
26 - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
27 action: replace
28 regex: ([^:]+)(?::\d+)?;(\d+)
29 replacement: $1:$2
30 target_label: __address__Essential Metrics to Monitor
1# Example alerting rules
2groups:
3- name: kubernetes.rules
4 rules:
5 - alert: HighCPUUsage
6 expr: container_cpu_usage_seconds_total{container!=""} > 0.8
7 for: 5m
8 labels:
9 severity: warning
10 annotations:
11 summary: "High CPU usage detected"
12 description: "Container {{ $labels.container }} is using {{ $value }} CPU"
13
14 - alert: HighMemoryUsage
15 expr: container_memory_usage_bytes{container!=""} / container_spec_memory_limit_bytes{container!=""} > 0.85
16 for: 5m
17 labels:
18 severity: warning
19 annotations:
20 summary: "High memory usage detected"
21 description: "Container {{ $labels.container }} is using {{ $value | humanizePercentage }} memory"4. Scaling Strategies: Handle Traffic Spikes
Horizontal vs Vertical Scaling
Horizontal scaling (adding more pods) is generally preferred in Kubernetes because it's more resilient and can handle traffic spikes better.
Best Practice: Implement HPA with Multiple Metrics
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: production-app-hpa
5spec:
6 scaleTargetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: production-app
10 minReplicas: 3
11 maxReplicas: 20
12 metrics:
13 - type: Resource
14 resource:
15 name: cpu
16 target:
17 type: Utilization
18 averageUtilization: 70
19 - type: Resource
20 resource:
21 name: memory
22 target:
23 type: Utilization
24 averageUtilization: 80
25 - type: Object
26 object:
27 metric:
28 name: requests-per-second
29 describedObject:
30 apiVersion: networking.k8s.io/v1
31 kind: Ingress
32 name: production-app-ingress
33 target:
34 type: Value
35 value: 1000
36 behavior:
37 scaleDown:
38 stabilizationWindowSeconds: 300
39 policies:
40 - type: Percent
41 value: 10
42 periodSeconds: 60
43 scaleUp:
44 stabilizationWindowSeconds: 60
45 policies:
46 - type: Percent
47 value: 100
48 periodSeconds: 15Key Features:
- Multiple metrics: CPU, memory, and custom metrics
- Stabilization windows: Prevent rapid scaling oscillations
- Conservative scale-down: Avoid scaling down too aggressively
- Aggressive scale-up: Respond quickly to traffic spikes
Pro Tip: Use VPA for Vertical Scaling
1apiVersion: autoscaling.k8s.io/v1
2kind: VerticalPodAutoscaler
3metadata:
4 name: production-app-vpa
5spec:
6 targetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: production-app
10 updatePolicy:
11 updateMode: "Off" # Use "Auto" for automatic updates
12 resourcePolicy:
13 containerPolicies:
14 - containerName: '*'
15 minAllowed:
16 cpu: 100m
17 memory: 50Mi
18 maxAllowed:
19 cpu: 1
20 memory: 500Mi
21 controlledValues: RequestsAndLimits5. Backup and Disaster Recovery: Protect Your Data
The 3-2-1 Backup Rule
- 3 copies of your data
- 2 different storage types
- 1 off-site backup
Best Practice: Automated Backup Strategy
1apiVersion: velero.io/v1
2kind: Schedule
3metadata:
4 name: production-daily-backup
5spec:
6 schedule: "0 2 * * *" # Daily at 2 AM
7 template:
8 includedNamespaces:
9 - production
10 includedResources:
11 - deployments
12 - services
13 - configmaps
14 - secrets
15 - persistentvolumeclaims
16 - persistentvolumes
17 storageLocation: production-backup
18 volumeSnapshotLocations:
19 - production-snapshot
20 ttl: 720h # Keep backups for 30 daysBackup Verification Strategy
1apiVersion: velero.io/v1
2kind: Schedule
3metadata:
4 name: backup-verification
5spec:
6 schedule: "0 4 * * 0" # Weekly on Sunday at 4 AM
7 template:
8 includedNamespaces:
9 - backup-test
10 includedResources:
11 - deployments
12 - services
13 - configmaps
14 - secrets
15 - persistentvolumeclaims
16 storageLocation: production-backup
17 volumeSnapshotLocations:
18 - production-snapshot
19 ttl: 24h6. Deployment Strategies: Zero-Downtime Updates
Rolling Updates with Health Checks
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: production-app
5spec:
6 replicas: 5
7 strategy:
8 type: RollingUpdate
9 rollingUpdate:
10 maxSurge: 1
11 maxUnavailable: 0
12 template:
13 spec:
14 containers:
15 - name: app
16 image: production-app:v1.2.0
17 readinessProbe:
18 httpGet:
19 path: /ready
20 port: 8080
21 initialDelaySeconds: 5
22 periodSeconds: 5
23 timeoutSeconds: 3
24 failureThreshold: 3
25 livenessProbe:
26 httpGet:
27 path: /health
28 port: 8080
29 initialDelaySeconds: 30
30 periodSeconds: 10
31 timeoutSeconds: 5
32 failureThreshold: 3Benefits:
- Zero downtime: New pods are ready before old ones are terminated
- Rollback capability: Easy to revert to previous version
- Health verification: Only healthy pods serve traffic
Key Takeaways
Immediate Actions You Can Take
- Set resource limits on all your containers today
- Implement health checks for every application
- Run containers as non-root users
- Set up basic monitoring with Prometheus
- Configure HPA for your critical applications
- Implement automated backups with Velero
Long-term Strategy
- Gradually implement security policies
- Build comprehensive monitoring dashboards
- Test disaster recovery procedures regularly
- Optimize resource usage based on monitoring data
- Automate everything possible
Most importantly, keep it simple! Overcomplicating your infrasturcutre will result in unimaginable growing pains.
Remember: Production Kubernetes is a journey, not a destination. Start with these fundamentals and continuously improve based on your specific needs and challenges.
Need help implementing these best practices? Join us on Slack
Related Posts
The Minimalist’s Guide to Homelab Setup
Deploy your favourite and essential homelab applications with Ankra in just a few clicks
A practical guide to wiring an infrastructure agent into your CI: review comments on pull requests, deploy verification on merge, and Slack reports that contain an actual root cause instead of a red X.