Quick Facts
- Category: Education & Careers
- Published: 2026-05-01 20:41:44
- AI-Powered Code Review Unearths Long-Standing Bugs in Linux's sched_ext Scheduler
- British Hacker Behind Tech Giants Phishing Spree Pleads Guilty
- FDA Greenlights Axsome's Breakthrough Treatment for Alzheimer's Agitation
- GitHub's Roadmap to Reliability: Addressing Availability and Scaling for the Future
- 10 Cloud Formations That Herald Winter’s End in the Gulf of Alaska
Why This Matters for Batch and ML Workloads
Batch and machine learning jobs often face a fundamental uncertainty: the exact resource requirements—CPU, memory, GPU, or specialized hardware—aren't always known at the moment of job creation. Optimal allocation depends on real-time cluster capacity, queue priorities, and availability of scarce accelerators like GPUs. Prior to Kubernetes v1.36, once a Job's pod template was set, its resource requests and limits were immutable. If a queue controller such as Kueue determined that a suspended Job needed different resources, the only recourse was to delete and recreate the entire Job—a process that wiped out valuable metadata, status history, and any associated annotations. With the promotion of this feature to beta in v1.36, cluster administrators and automated schedulers can now modify container resource specifications on a suspended Job without losing its identity or history.
The Old Way: Inflexible Resource Allocation
In earlier releases, a Job's pod template was carved in stone after creation. For example, consider an ML training Job initially requesting 4 GPUs:
apiVersion: batch/v1
kind: Job
metadata:
name: training-job-example-abcd123
spec:
suspend: true
template:
spec:
containers:
- name: trainer
resources:
requests:
cpu: "8"
memory: "32Gi"
example-hardware-vendor.com/gpu: "4"
limits:
cpu: "8"
memory: "32Gi"
example-hardware-vendor.com/gpu: "4"
restartPolicy: Never
If the cluster only had 2 GPUs available, the queue controller had no choice but to delete and re‑create the Job with reduced requests—a disruptive process that erased the Job’s history. This limitation was especially painful for CronJob-triggered workloads, where a particular instance might need to run with fewer resources rather than failing outright under load.
Real-World Example: Adjusting GPU Count Dynamically
With the new feature, a queue controller can update the suspended Job’s pod template in place. For instance, the controller scans the cluster and finds only 2 GPUs are free. It then adjusts the resource requests and limits to match:
apiVersion: batch/v1
kind: Job
metadata:
name: training-job-example-abcd123
spec:
suspend: true
template:
spec:
containers:
- name: trainer
resources:
requests:
cpu: "4"
memory: "16Gi"
example-hardware-vendor.com/gpu: "2"
limits:
cpu: "4"
memory: "16Gi"
example-hardware-vendor.com/gpu: "2"
restartPolicy: Never
After the update, the controller sets spec.suspend to false, and the Job springs to life with the revised resource profile. The Job’s name, labels, annotations, and status remain intact. This capability is a game changer for batch systems that require dynamic resource negotiation.
How It Works in Practice
The core change is a targeted relaxation of the immutability constraint on pod template resource fields—but only for Jobs that are suspended. No new API types or breaking changes are introduced; the existing Job and Pod template structures are reused.
Behind the Scenes: API Relaxation
The Kubernetes API server now allows modifications to spec.template.spec.containers[*].resources.requests and .limits when spec.suspend is true. Once the Job is resumed, the pod template becomes immutable again until the Job is re‑suspended. This design ensures that the feature is safe and predictable: resources can only be adjusted while the Job is not actively running
Integration with Queue Controllers like Kueue
Queue controllers and batch scheduling frameworks are the primary beneficiaries. Kueue, for example, can now adjust resource requirements for suspended Jobs during the admission or preemption phases without destroying and recreating them. This streamlines the workflow for complex batch pipelines, where multiple Jobs may be queued behind different resource constraints. The controller can also downgrade resource requests for a specific CronJob instance, allowing it to make progress slowly under heavy cluster load instead of failing.
Benefits for Cluster Administrators
Preserving Job Metadata and Status
The most immediate advantage is the preservation of Job identity. When a Job is deleted and re‑created, all metadata—including labels, annotations, creation timestamp, and associated events—is lost. With mutable resources, administrators can fine‑tune allocations without disrupting the Job’s lifecycle. This is essential for auditing, monitoring, and maintaining lineage in production environments.
Graceful Degradation Under Load
Another key benefit is the ability to implement graceful degradation. Suppose a CronJob triggers a resource‑intensive Job during a period of high cluster usage. Instead of the Job failing due to insufficient resources, the queue controller can reduce the resource requests (e.g., shrink memory or drop one GPU) and let the Job run with lower throughput. This keeps the system functioning and reduces the need for manual intervention.
Getting Started with Mutable Pod Resources
The feature is enabled by default in Kubernetes v1.36 (beta). No special feature gates need to be set. To try it out, simply create a suspended Job, then patch its spec.template.spec.containers resource fields before resuming. Queue controllers can be updated to leverage this capability for smarter scheduling. For more details, refer to the How It Works section above and the official Kubernetes documentation.
Conclusion
Kubernetes v1.36’s mutable pod resources for suspended Jobs address a long‑standing pain point for batch and ML workloads. By allowing in‑place adjustments to CPU, memory, GPU, and extended resources, the platform enables more flexible and resilient scheduling without sacrificing Job history. Whether you’re running a large‑scale training pipeline or a simple data processing job, this feature helps your workloads adapt to changing cluster conditions and optimize resource utilization.