Kubernetes & Cloud Security Hardening Guide
A practical reference for engineering teams securing production Kubernetes clusters and cloud infrastructure. Covers the most common misconfigurations found in real assessments — with specific steps to fix them.
1. Kubernetes Hardening
1.1 RBAC — Least Privilege Access
Role-Based Access Control is your first line of defense inside a cluster. Most misconfigurations come from over-permissive bindings created during initial setup and never reviewed.
Most common finding: ClusterRoleBindings with cluster-admin attached to service accounts used by applications. This gives the workload full control over the entire cluster.
- Audit all ClusterRoleBindings: Run
kubectl get clusterrolebindings -o wideand review anything bound tocluster-admin. Every binding should have a documented justification. - Use namespace-scoped Roles instead of ClusterRoles wherever the workload doesn't need cluster-wide access.
- Avoid wildcard permissions:
verbs: ["*"]orresources: ["*"]in Role specs should be treated as a finding. - Never bind service accounts to
cluster-adminunless you have a documented and reviewed reason. Operators (like ArgoCD) often need elevated access — scope it to what they actually use. - Disable automounting of service account tokens for pods that don't need Kubernetes API access:
automountServiceAccountToken: false
1.2 Pod Security Standards
Kubernetes replaced PodSecurityPolicy (removed in 1.25) with Pod Security Standards (PSS) enforced via the Pod Security Admission controller. Many clusters still have no policy enforced.
| PSS Level | What it prevents | Recommended for |
|---|---|---|
| Restricted | Privileged containers, host namespaces, unsafe sysctls, root users | All application workloads |
| Baseline | Most known privilege escalation paths | Operators and system components |
| Privileged | Nothing — unrestricted | Only trusted system namespaces (kube-system) |
- Label namespaces with
pod-security.kubernetes.io/enforce: restrictedfor application workloads. - Use
enforce: warnfirst to identify workloads that need remediation before switching to enforce. - Run containers as non-root:
securityContext.runAsNonRoot: trueand set an explicitrunAsUser. - Drop all capabilities and add back only what's needed:
capabilities: { drop: ["ALL"], add: ["NET_BIND_SERVICE"] } - Set
readOnlyRootFilesystem: truewherever the app supports it. - Disable privilege escalation:
allowPrivilegeEscalation: false
1.3 Network Policies
By default, all pods in a Kubernetes cluster can communicate with all other pods — including across namespaces. Without NetworkPolicies, a compromised pod can reach your database, your secrets store, and your other services without restriction.
Baseline rule: Every namespace should have a default-deny-all ingress and egress policy. Then explicitly open only what's needed.
- Start with a default deny-all policy in every namespace that hosts workloads.
- Allow only required ingress — e.g., only the frontend can reach the backend, only the backend can reach the database.
- Restrict egress: workloads should not be able to reach arbitrary external IPs unless required.
- Use namespace selectors in addition to pod selectors to prevent cross-namespace lateral movement.
- If using Cilium, leverage CiliumNetworkPolicy for L7-aware policies (HTTP method, path, DNS).
1.4 Secrets Management
Kubernetes Secrets are base64-encoded, not encrypted — and are often stored in etcd unencrypted. Hardcoded secrets in manifests committed to Git are one of the most common findings in assessments.
- Never commit secrets to Git. Use a secrets management system — Doppler, Vault, AWS Secrets Manager, or GCP Secret Manager — and inject at runtime.
- Enable encryption at rest for etcd. Configure
EncryptionConfigurationwith AES-GCM or KMS provider. - Use the External Secrets Operator (ESO) to sync secrets from your external store into Kubernetes Secrets automatically.
- Restrict access to Secrets via RBAC —
get,list, andwatchon Secrets are often granted too broadly. - Audit your Git history for accidentally committed secrets. Tools like
trufflehogandgitleakscan scan for them.
1.5 API Server Hardening
- Disable anonymous authentication:
--anonymous-auth=false - Enable audit logging: configure
--audit-log-pathand a policy file that captures reads and writes to sensitive resources. - Restrict access to the API server by network — it should not be publicly reachable from the internet.
- Use
--authorization-mode=Node,RBAC— neverAlwaysAllow. - Disable the insecure port:
--insecure-port=0 - Set
--profiling=falseto disable API server profiling endpoints.
1.6 Image Security
- Use specific image tags or SHA digests — never
:latest. Floating tags can pull different code between deployments. - Scan images for vulnerabilities in CI before deployment. Trivy and Grype are both free and effective.
- Use a private registry with access controls — don't pull directly from public registries in production.
- Implement an admission controller (OPA/Gatekeeper or Kyverno) to enforce image policies — e.g., only images from your trusted registry are allowed.
- Keep base images minimal. Distroless or Alpine-based images have a significantly smaller attack surface than Ubuntu or Debian full images.
2. Cloud Configuration
2.1 IAM — Least Privilege
Overly permissive IAM is the single most common cloud finding. The combination of AdministratorAccess attached to service accounts and long-lived access keys is the leading cause of cloud breaches.
Key principle: Every identity (human or service) should have only the permissions it needs to do its job — nothing more. Review this quarterly.
- Eliminate long-lived access keys for services running in cloud environments — use IAM roles for EC2/GKE Workload Identity/Azure Managed Identity instead.
- Never use root/owner accounts for day-to-day operations. Create a separate admin account with MFA enforced.
- Enable MFA for all human IAM users, especially those with write or admin access.
- Audit and remove unused IAM users, roles, and access keys. AWS IAM Access Analyzer and GCP Policy Analyzer can help.
- Use permission boundaries in AWS to limit the maximum permissions a role can grant, even if the policy is misconfigured.
- Prefer managed identity over service account keys in GCP — keys can be exported and exfiltrated; managed identities cannot.
2.2 Storage Security
Exposed S3 buckets and GCS buckets have been responsible for some of the largest data breaches on record. "Public" access is often set unintentionally during development and never removed.
- Block all public access at the account level in AWS (S3 Block Public Access settings). Enable this at the organization level in AWS Organizations.
- Enable bucket versioning and object lock for buckets containing critical data.
- Enforce server-side encryption for all buckets — at minimum SSE-S3, preferably SSE-KMS with a customer-managed key.
- Review bucket policies and ACLs regularly. Any policy that contains
"Principal": "*"should be treated as a critical finding. - Enable access logging on buckets that contain sensitive data — CloudTrail data events for S3 in AWS.
2.3 Network Security
- No security group or firewall rule should allow inbound access from
0.0.0.0/0except for ports 80 and 443 on public-facing load balancers. - SSH (port 22) and RDP (port 3389) should never be open to the internet. Use a bastion host, Session Manager (AWS), or Identity-Aware Proxy (GCP) instead.
- Enforce VPC flow logs in AWS and VPC flow logs in GCP for all production networks.
- Use private endpoints / Private Service Connect to access cloud services without traversing the public internet.
- Segment workloads by VPC and use VPC peering or Transit Gateway with explicit route control — don't put everything in one flat network.
2.4 Logging and Monitoring
You can't detect what you don't log. The most critical logs are often disabled by default and have cost implications — but the cost of not having them during an incident is much higher.
- Enable CloudTrail (AWS) or Cloud Audit Logs (GCP) for management and data events in all regions and all accounts.
- Set CloudTrail log file validation to detect tampering.
- Configure alerts on high-risk events: root account login, policy changes, security group modifications, large data exports.
- Enable GuardDuty (AWS) or Security Command Center (GCP) — these are the fastest wins for threat detection with minimal configuration.
- Centralize logs in an account or project that developers cannot modify — an attacker with write access to the account they compromised can delete their own trail.
2.5 Kubernetes on Cloud (EKS / GKE / AKS)
- Enable Workload Identity (GKE) or IRSA / Pod Identity (EKS) to give pods cloud credentials — never mount service account key files into pods.
- Use private clusters: API server endpoint not publicly accessible, nodes on private subnets.
- Enable managed node upgrades and stay within two minor versions of the current Kubernetes release. EOL Kubernetes versions stop receiving security patches.
- Enable network policy enforcement at the CNI level (Calico, Cilium, or the cloud provider's native network policy).
- Review the cloud provider's security benchmarks — GKE Security Posture, EKS Best Practices Guide, and AKS security baseline all have specific guidance for their platform.
3. Compliance Readiness
3.1 SOC 2 — What engineering teams actually need to do
SOC 2 is the most common compliance requirement for B2B SaaS companies. The Trust Service Criteria (TSC) map to specific technical controls your engineering team owns.
| TSC Category | What you need | Common gap |
|---|---|---|
| Logical Access | MFA everywhere, access reviews, role-based access | Shared accounts, no offboarding process |
| Change Management | Code review process, deployment approvals, audit trail | Direct pushes to main, no PR required |
| Incident Response | Documented IR plan, on-call rotation, post-mortems | No documented plan, no runbooks |
| Monitoring | Alerting on security events, log retention 90+ days | No centralized logging, no alerts on privilege changes |
| Availability | SLOs defined, backup and recovery tested | Backups never tested for restore |
3.2 CIS Kubernetes Benchmark — Priority Controls
The CIS Kubernetes Benchmark has 100+ controls. The ones below have the highest risk-to-effort ratio and should be your first pass.
- CRITICAL — Enable RBAC and disable ABAC (
--authorization-modemust include RBAC) - CRITICAL — Disable anonymous authentication on API server and kubelet
- CRITICAL — Enable etcd encryption at rest
- HIGH — Enable audit logging with a policy that captures sensitive resource access
- HIGH — Configure NetworkPolicies in all workload namespaces
- HIGH — Enforce Pod Security Standards at the namespace level
- MED — Disable service account token auto-mounting for pods that don't need it
- MED — Set resource requests and limits on all containers
- LOW — Enable node restriction admission plugin
3.3 Secrets and Data Classification
- Classify the data your application stores — not all data requires the same controls. PII, PHI, financial data, and credentials need stricter handling than logging data.
- Document your data flows: where does sensitive data enter, where does it go, who can access it, and how is it deleted? Auditors will ask this.
- Implement a secrets rotation policy. Credentials should have a maximum lifetime and be rotated automatically where possible.
- Never log sensitive data. Review your application logs for accidental PII leakage — full request bodies, auth headers, and error messages are common offenders.
4. Where to Start — Priority Action List
If you're doing a first pass on security hardening, this is the order of operations that gives you the most risk reduction for the least effort.
- Audit RBAC bindings — find and remove any
cluster-adminbindings that don't have a clear justification. - Enable MFA for all cloud IAM users and your Kubernetes API access (if using SSO/OIDC).
- Rotate and remove long-lived credentials — access keys, service account key files, hardcoded tokens.
- Enable cloud audit logging — CloudTrail, Cloud Audit Logs, or Azure Monitor. Don't let this stay off.
- Block public access on storage — S3 Block Public Access at the account level, GCS uniform bucket-level access.
- Review security group / firewall rules — remove any rule allowing inbound
0.0.0.0/0on non-web ports. - Apply Pod Security Standards — start in
warnmode, fix issues, then switch toenforce. - Add NetworkPolicies — default-deny in workload namespaces, then open only required traffic.
- Scan your container images — integrate Trivy or Grype into CI and fail the build on critical findings.
- Set up GuardDuty or Security Command Center — one-click enable, immediate threat detection coverage.
Note: This guide covers the highest-impact controls but isn't exhaustive. Production security requires ongoing review, not a one-time pass. If you want a professional set of eyes on your specific setup, the assessment is a good starting point.
Want a professional review of your specific setup?
This guide covers the common patterns. A real assessment looks at your actual infrastructure — your RBAC bindings, your network policies, your cloud configs — and tells you exactly what needs to change.