Zero downtime deployments for stack updates

**Is your feature request related to a problem? Please describe.**

I'm always frustrated when I need to make stack updates and I have to incur downtime to the agent pool when doing so. When I deploy updates, all agents in that group are terminated and replaced. This leads to jobs failing with `Exited with status -1 (agent lost)`. I then have to manually restart all those jobs or rely on users to do so.

**Describe the solution you'd like**

I would like agents to drain their workload before being terminated and replaced during a stack update. 

**Describe alternatives you've considered**

- Performing the stack update during non-peak hours. 
- Manually creating an adjacent stack, migrating to the stack, and then turning off the original stack

**Additional context**

Perhaps using AWS lifecycle hooks to put instances in a Terminating:Wait state to allow draining would be helpful. 

Alternatively, I could detaching all instances from the ASG before stack update, but I then have the problem of determining when those agents are drained and can be terminated. Maybe if the buildkite-agent service could detect the detached state and then drain the workload that would be helpful 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Zero downtime deployments for stack updates #1010

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Zero downtime deployments for stack updates #1010

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions