Skip to content

Zero downtime deployments for stack updates #1010

Description

@joeljeske

Is your feature request related to a problem? Please describe.

I'm always frustrated when I need to make stack updates and I have to incur downtime to the agent pool when doing so. When I deploy updates, all agents in that group are terminated and replaced. This leads to jobs failing with Exited with status -1 (agent lost). I then have to manually restart all those jobs or rely on users to do so.

Describe the solution you'd like

I would like agents to drain their workload before being terminated and replaced during a stack update.

Describe alternatives you've considered

  • Performing the stack update during non-peak hours.
  • Manually creating an adjacent stack, migrating to the stack, and then turning off the original stack

Additional context

Perhaps using AWS lifecycle hooks to put instances in a Terminating:Wait state to allow draining would be helpful.

Alternatively, I could detaching all instances from the ASG before stack update, but I then have the problem of determining when those agents are drained and can be terminated. Maybe if the buildkite-agent service could detect the detached state and then drain the workload that would be helpful

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions