Is your feature request related to a problem? Please describe.
I'm always frustrated when I need to make stack updates and I have to incur downtime to the agent pool when doing so. When I deploy updates, all agents in that group are terminated and replaced. This leads to jobs failing with Exited with status -1 (agent lost). I then have to manually restart all those jobs or rely on users to do so.
Describe the solution you'd like
I would like agents to drain their workload before being terminated and replaced during a stack update.
Describe alternatives you've considered
- Performing the stack update during non-peak hours.
- Manually creating an adjacent stack, migrating to the stack, and then turning off the original stack
Additional context
Perhaps using AWS lifecycle hooks to put instances in a Terminating:Wait state to allow draining would be helpful.
Alternatively, I could detaching all instances from the ASG before stack update, but I then have the problem of determining when those agents are drained and can be terminated. Maybe if the buildkite-agent service could detect the detached state and then drain the workload that would be helpful
Is your feature request related to a problem? Please describe.
I'm always frustrated when I need to make stack updates and I have to incur downtime to the agent pool when doing so. When I deploy updates, all agents in that group are terminated and replaced. This leads to jobs failing with
Exited with status -1 (agent lost). I then have to manually restart all those jobs or rely on users to do so.Describe the solution you'd like
I would like agents to drain their workload before being terminated and replaced during a stack update.
Describe alternatives you've considered
Additional context
Perhaps using AWS lifecycle hooks to put instances in a Terminating:Wait state to allow draining would be helpful.
Alternatively, I could detaching all instances from the ASG before stack update, but I then have the problem of determining when those agents are drained and can be terminated. Maybe if the buildkite-agent service could detect the detached state and then drain the workload that would be helpful